Skip to content

Blog

MongoDB Change Streams and Real-time Event Processing: Advanced Microservices Architecture Patterns for Event-Driven Applications

Modern distributed applications require sophisticated event-driven architectures that can process real-time data changes, coordinate microservices communication, and maintain system consistency across complex distributed topologies. Traditional polling-based approaches to change detection introduce latency, resource waste, and scaling challenges that become increasingly problematic as application complexity and data volumes grow.

MongoDB Change Streams provide a powerful, efficient mechanism for building reactive applications that respond to data changes in real-time without the overhead and complexity of traditional change detection patterns. Unlike database triggers or polling-based solutions that require complex infrastructure and introduce performance bottlenecks, Change Streams offer a scalable, resumable, and ordered stream of change events that enables sophisticated event-driven architectures, microservices coordination, and real-time analytics.

The Traditional Change Detection Challenge

Conventional change detection approaches suffer from significant limitations for real-time application requirements:

-- Traditional PostgreSQL change detection with LISTEN/NOTIFY - limited scalability and functionality

-- Basic trigger-based notification system
CREATE OR REPLACE FUNCTION notify_order_changes()
RETURNS TRIGGER AS $$
BEGIN
  IF TG_OP = 'INSERT' THEN
    PERFORM pg_notify('order_created', json_build_object(
      'operation', 'INSERT',
      'order_id', NEW.order_id,
      'user_id', NEW.user_id,
      'total_amount', NEW.total_amount,
      'timestamp', NOW()
    )::text);
    RETURN NEW;
  ELSIF TG_OP = 'UPDATE' THEN
    PERFORM pg_notify('order_updated', json_build_object(
      'operation', 'UPDATE',
      'order_id', NEW.order_id,
      'old_status', OLD.status,
      'new_status', NEW.status,
      'timestamp', NOW()
    )::text);
    RETURN NEW;
  ELSIF TG_OP = 'DELETE' THEN
    PERFORM pg_notify('order_deleted', json_build_object(
      'operation', 'DELETE',
      'order_id', OLD.order_id,
      'user_id', OLD.user_id,
      'timestamp', NOW()
    )::text);
    RETURN OLD;
  END IF;
  RETURN NULL;
END;
$$ LANGUAGE plpgsql;

-- Attach triggers to orders table
CREATE TRIGGER order_changes_trigger
  AFTER INSERT OR UPDATE OR DELETE ON orders
  FOR EACH ROW EXECUTE FUNCTION notify_order_changes();

-- Client-side change listening with significant limitations
-- Node.js example showing polling approach complexity

const { Client } = require('pg');
const EventEmitter = require('events');

class PostgreSQLChangeListener extends EventEmitter {
  constructor(connectionConfig) {
    super();
    this.client = new Client(connectionConfig);
    this.isListening = false;
    this.reconnectAttempts = 0;
    this.maxReconnectAttempts = 5;
    this.lastProcessedId = null;

    // Complex connection management required
    this.setupErrorHandlers();
  }

  async startListening() {
    try {
      await this.client.connect();

      // Listen to specific channels
      await this.client.query('LISTEN order_created');
      await this.client.query('LISTEN order_updated');
      await this.client.query('LISTEN order_deleted');
      await this.client.query('LISTEN user_activity');

      this.isListening = true;
      console.log('Started listening for database changes...');

      // Handle incoming notifications
      this.client.on('notification', async (msg) => {
        try {
          const changeData = JSON.parse(msg.payload);
          await this.processChange(msg.channel, changeData);
        } catch (error) {
          console.error('Error processing notification:', error);
          this.emit('error', error);
        }
      });

      // Poll for missed changes during disconnection
      this.startMissedChangePolling();

    } catch (error) {
      console.error('Failed to start listening:', error);
      await this.handleReconnection();
    }
  }

  async processChange(channel, changeData) {
    console.log(`Processing ${channel} change:`, changeData);

    // Complex event processing logic
    switch (channel) {
      case 'order_created':
        await this.handleOrderCreated(changeData);
        break;
      case 'order_updated':
        await this.handleOrderUpdated(changeData);
        break;
      case 'order_deleted':
        await this.handleOrderDeleted(changeData);
        break;
      default:
        console.warn(`Unknown channel: ${channel}`);
    }

    // Update processing checkpoint
    this.lastProcessedId = changeData.order_id;
  }

  async handleOrderCreated(orderData) {
    // Microservice coordination complexity
    const coordinationTasks = [
      this.notifyInventoryService(orderData),
      this.notifyPaymentService(orderData),
      this.notifyShippingService(orderData),
      this.notifyAnalyticsService(orderData),
      this.updateCustomerProfile(orderData)
    ];

    try {
      await Promise.all(coordinationTasks);
      console.log(`Successfully coordinated order creation: ${orderData.order_id}`);
    } catch (error) {
      console.error('Coordination failed:', error);
      // Complex error handling and retry logic required
      await this.handleCoordinationFailure(orderData, error);
    }
  }

  async startMissedChangePolling() {
    // Polling fallback for missed changes during disconnection
    setInterval(async () => {
      if (!this.isListening) return;

      try {
        const query = `
          SELECT 
            o.order_id,
            o.user_id,
            o.status,
            o.total_amount,
            o.created_at,
            o.updated_at,
            'order' as entity_type,
            CASE 
              WHEN o.created_at > NOW() - INTERVAL '5 minutes' THEN 'created'
              WHEN o.updated_at > NOW() - INTERVAL '5 minutes' THEN 'updated'
            END as change_type
          FROM orders o
          WHERE (o.created_at > NOW() - INTERVAL '5 minutes' 
                 OR o.updated_at > NOW() - INTERVAL '5 minutes')
            AND o.order_id > $1
          ORDER BY o.order_id
          LIMIT 1000
        `;

        const result = await this.client.query(query, [this.lastProcessedId || 0]);

        for (const row of result.rows) {
          await this.processChange(`order_${row.change_type}`, row);
        }

      } catch (error) {
        console.error('Polling error:', error);
      }
    }, 30000); // Poll every 30 seconds
  }

  async handleReconnection() {
    if (this.reconnectAttempts >= this.maxReconnectAttempts) {
      console.error('Max reconnection attempts reached');
      this.emit('fatal_error', new Error('Connection permanently lost'));
      return;
    }

    this.reconnectAttempts++;
    const delay = Math.pow(2, this.reconnectAttempts) * 1000; // Exponential backoff

    console.log(`Attempting reconnection ${this.reconnectAttempts}/${this.maxReconnectAttempts} in ${delay}ms`);

    setTimeout(async () => {
      try {
        await this.client.end();
        this.client = new Client(this.connectionConfig);
        this.setupErrorHandlers();
        await this.startListening();
        this.reconnectAttempts = 0;
      } catch (error) {
        console.error('Reconnection failed:', error);
        await this.handleReconnection();
      }
    }, delay);
  }

  setupErrorHandlers() {
    this.client.on('error', async (error) => {
      console.error('PostgreSQL connection error:', error);
      this.isListening = false;
      await this.handleReconnection();
    });

    this.client.on('end', () => {
      console.log('PostgreSQL connection ended');
      this.isListening = false;
    });
  }
}

// Problems with traditional PostgreSQL LISTEN/NOTIFY approach:
// 1. Limited payload size (8000 bytes) restricts change data detail
// 2. No guaranteed delivery - notifications lost during disconnection
// 3. No ordering guarantees across multiple channels
// 4. Complex reconnection and missed change handling logic required
// 5. Limited filtering capabilities - all listeners receive all notifications
// 6. No built-in support for change resumption from specific points
// 7. Scalability limitations with many concurrent listeners
// 8. Manual coordination required for microservices communication
// 9. Complex error handling and retry mechanisms needed
// 10. No native support for document-level change tracking

// MySQL limitations are even more restrictive
-- MySQL basic replication events (limited functionality)
SHOW MASTER STATUS;
SHOW SLAVE STATUS;

-- MySQL binary log parsing (complex and fragile)
-- Requires external tools like Maxwell or Debezium
-- Limited change event structure and filtering
-- Complex setup and operational overhead
-- No native application-level change streams
-- Poor support for real-time event processing

MongoDB Change Streams provide comprehensive real-time change processing:

// MongoDB Change Streams - comprehensive real-time event processing with advanced patterns
const { MongoClient } = require('mongodb');
const EventEmitter = require('events');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('ecommerce_platform');

// Advanced MongoDB Change Streams manager for microservices architecture
class MongoChangeStreamManager extends EventEmitter {
  constructor(db) {
    super();
    this.db = db;
    this.collections = {
      orders: db.collection('orders'),
      users: db.collection('users'),
      products: db.collection('products'),
      inventory: db.collection('inventory'),
      payments: db.collection('payments')
    };

    this.changeStreams = new Map();
    this.eventProcessors = new Map();
    this.resumeTokens = new Map();
    this.processingStats = new Map();

    // Advanced configuration for production use
    this.streamConfig = {
      batchSize: 100,
      maxAwaitTimeMS: 1000,
      fullDocument: 'updateLookup',
      fullDocumentBeforeChange: 'whenAvailable',
      startAtOperationTime: null,
      resumeAfter: null
    };

    // Event processing pipeline
    this.eventQueue = [];
    this.isProcessing = false;
    this.maxQueueSize = 10000;

    this.setupEventProcessors();
  }

  async initializeChangeStreams(streamConfigurations) {
    console.log('Initializing MongoDB Change Streams for microservices architecture...');

    for (const [streamName, config] of Object.entries(streamConfigurations)) {
      try {
        console.log(`Setting up change stream: ${streamName}`);
        await this.createChangeStream(streamName, config);
      } catch (error) {
        console.error(`Failed to create change stream ${streamName}:`, error);
        this.emit('stream_error', { streamName, error });
      }
    }

    // Start event processing
    this.startEventProcessing();

    console.log(`${this.changeStreams.size} change streams initialized successfully`);
    return this.getStreamStatus();
  }

  async createChangeStream(streamName, config) {
    const {
      collection,
      pipeline = [],
      options = {},
      processor,
      resumeToken = null
    } = config;

    // Build comprehensive change stream pipeline
    const changeStreamPipeline = [
      // Stage 1: Filter by operation types if specified
      ...(config.operationTypes ? [
        { $match: { operationType: { $in: config.operationTypes } } }
      ] : []),

      // Stage 2: Document-level filtering
      ...(config.documentFilter ? [
        { $match: config.documentFilter }
      ] : []),

      // Stage 3: Field-level filtering for efficiency
      ...(config.fieldFilter ? [
        { $project: config.fieldFilter }
      ] : []),

      // Custom pipeline stages
      ...pipeline
    ];

    const streamOptions = {
      ...this.streamConfig,
      ...options,
      ...(resumeToken && { resumeAfter: resumeToken })
    };

    const targetCollection = this.collections[collection] || this.db.collection(collection);
    const changeStream = targetCollection.watch(changeStreamPipeline, streamOptions);

    // Configure change stream event handlers
    this.setupChangeStreamHandlers(streamName, changeStream, processor);

    this.changeStreams.set(streamName, {
      stream: changeStream,
      collection: collection,
      processor: processor,
      config: config,
      stats: {
        eventsProcessed: 0,
        errors: 0,
        lastEventTime: null,
        startTime: new Date()
      }
    });

    console.log(`Change stream '${streamName}' created for collection '${collection}'`);
    return changeStream;
  }

  setupChangeStreamHandlers(streamName, changeStream, processor) {
    changeStream.on('change', async (changeDoc) => {
      try {
        // Extract resume token for fault tolerance
        this.resumeTokens.set(streamName, changeDoc._id);

        // Add comprehensive change metadata
        const enhancedChange = {
          ...changeDoc,
          streamName: streamName,
          receivedAt: new Date(),
          processingMetadata: {
            retryCount: 0,
            priority: this.calculateEventPriority(changeDoc),
            correlationId: this.generateCorrelationId(changeDoc),
            traceId: this.generateTraceId()
          }
        };

        // Queue for processing
        await this.queueChangeEvent(enhancedChange, processor);

        // Update statistics
        this.updateStreamStats(streamName, 'event_received');

      } catch (error) {
        console.error(`Error handling change in stream ${streamName}:`, error);
        this.updateStreamStats(streamName, 'error');
        this.emit('change_error', { streamName, error, changeDoc });
      }
    });

    changeStream.on('error', async (error) => {
      console.error(`Change stream ${streamName} error:`, error);
      this.updateStreamStats(streamName, 'stream_error');

      // Attempt to resume from last known position
      if (error.code === 40585 || error.code === 136) { // Resume token expired or invalid
        console.log(`Attempting to resume change stream ${streamName}...`);
        await this.resumeChangeStream(streamName);
      } else {
        this.emit('stream_error', { streamName, error });
      }
    });

    changeStream.on('close', () => {
      console.log(`Change stream ${streamName} closed`);
      this.emit('stream_closed', { streamName });
    });
  }

  async queueChangeEvent(changeEvent, processor) {
    // Prevent queue overflow
    if (this.eventQueue.length >= this.maxQueueSize) {
      console.warn('Event queue at capacity, dropping oldest events');
      this.eventQueue.splice(0, Math.floor(this.maxQueueSize * 0.1)); // Drop 10% of oldest
    }

    // Add event to processing queue with priority ordering
    this.eventQueue.push({ changeEvent, processor });
    this.eventQueue.sort((a, b) => 
      b.changeEvent.processingMetadata.priority - a.changeEvent.processingMetadata.priority
    );

    // Start processing if not already running
    if (!this.isProcessing) {
      setImmediate(() => this.processEventQueue());
    }
  }

  async processEventQueue() {
    if (this.isProcessing || this.eventQueue.length === 0) return;

    this.isProcessing = true;

    try {
      while (this.eventQueue.length > 0) {
        const { changeEvent, processor } = this.eventQueue.shift();

        try {
          const startTime = Date.now();
          await this.processChangeEvent(changeEvent, processor);
          const processingTime = Date.now() - startTime;

          // Update processing metrics
          this.updateProcessingMetrics(changeEvent.streamName, processingTime, true);

        } catch (error) {
          console.error('Event processing failed:', error);

          // Implement retry logic
          if (changeEvent.processingMetadata.retryCount < 3) {
            changeEvent.processingMetadata.retryCount++;
            changeEvent.processingMetadata.priority -= 1; // Lower priority for retries
            this.eventQueue.unshift({ changeEvent, processor });
          } else {
            console.error('Max retries reached for event:', changeEvent._id);
            this.emit('event_failed', { changeEvent, error });
          }

          this.updateProcessingMetrics(changeEvent.streamName, 0, false);
        }
      }
    } finally {
      this.isProcessing = false;
    }
  }

  async processChangeEvent(changeEvent, processor) {
    const { operationType, fullDocument, documentKey, updateDescription } = changeEvent;

    console.log(`Processing ${operationType} event for ${changeEvent.streamName}`);

    // Execute processor function with comprehensive context
    const processingContext = {
      operation: operationType,
      document: fullDocument,
      documentKey: documentKey,
      updateDescription: updateDescription,
      timestamp: changeEvent.clusterTime,
      metadata: changeEvent.processingMetadata,

      // Utility functions
      isInsert: () => operationType === 'insert',
      isUpdate: () => operationType === 'update',
      isDelete: () => operationType === 'delete',
      isReplace: () => operationType === 'replace',

      // Field change utilities
      hasFieldChanged: (fieldName) => {
        return updateDescription?.updatedFields?.hasOwnProperty(fieldName) ||
               updateDescription?.removedFields?.includes(fieldName);
      },

      getFieldChange: (fieldName) => {
        return updateDescription?.updatedFields?.[fieldName];
      },

      // Document utilities
      getDocumentId: () => documentKey._id,
      getFullDocument: () => fullDocument
    };

    // Execute the processor
    await processor(processingContext);
  }

  setupEventProcessors() {
    // Order lifecycle management processor
    this.eventProcessors.set('orderLifecycle', async (context) => {
      const { operation, document, hasFieldChanged } = context;

      switch (operation) {
        case 'insert':
          await this.handleOrderCreated(document);
          break;

        case 'update':
          if (hasFieldChanged('status')) {
            await this.handleOrderStatusChange(document, context.getFieldChange('status'));
          }
          if (hasFieldChanged('payment_status')) {
            await this.handlePaymentStatusChange(document, context.getFieldChange('payment_status'));
          }
          if (hasFieldChanged('shipping_status')) {
            await this.handleShippingStatusChange(document, context.getFieldChange('shipping_status'));
          }
          break;

        case 'delete':
          await this.handleOrderCancelled(context.getDocumentId());
          break;
      }
    });

    // Inventory management processor
    this.eventProcessors.set('inventorySync', async (context) => {
      const { operation, document, hasFieldChanged } = context;

      if (operation === 'insert' && document.items) {
        // New order - reserve inventory
        await this.reserveInventoryForOrder(document);
      } else if (operation === 'update' && hasFieldChanged('status')) {
        const newStatus = context.getFieldChange('status');

        if (newStatus === 'cancelled') {
          await this.releaseInventoryReservation(document);
        } else if (newStatus === 'shipped') {
          await this.confirmInventoryConsumption(document);
        }
      }
    });

    // Real-time analytics processor
    this.eventProcessors.set('realTimeAnalytics', async (context) => {
      const { operation, document, timestamp } = context;

      // Update real-time metrics
      const analyticsEvent = {
        eventType: `order_${operation}`,
        timestamp: timestamp,
        data: {
          orderId: context.getDocumentId(),
          customerId: document?.user_id,
          amount: document?.total_amount,
          region: document?.shipping_address?.region,
          products: document?.items?.map(item => item.product_id)
        }
      };

      await this.updateRealTimeMetrics(analyticsEvent);
    });

    // Customer engagement processor
    this.eventProcessors.set('customerEngagement', async (context) => {
      const { operation, document, hasFieldChanged } = context;

      if (operation === 'insert') {
        // New order - update customer profile
        await this.updateCustomerOrderHistory(document.user_id, document);

        // Trigger post-purchase engagement
        await this.triggerPostPurchaseEngagement(document);

      } else if (operation === 'update' && hasFieldChanged('status')) {
        const newStatus = context.getFieldChange('status');

        if (newStatus === 'delivered') {
          // Order delivered - trigger review request
          await this.triggerReviewRequest(document);
        }
      }
    });
  }

  async handleOrderCreated(orderDocument) {
    console.log(`Processing new order: ${orderDocument._id}`);

    // Coordinate microservices for order creation
    const coordinationTasks = [
      this.notifyPaymentService({
        action: 'process_payment',
        orderId: orderDocument._id,
        amount: orderDocument.total_amount,
        paymentMethod: orderDocument.payment_method
      }),

      this.notifyInventoryService({
        action: 'reserve_inventory',
        orderId: orderDocument._id,
        items: orderDocument.items
      }),

      this.notifyShippingService({
        action: 'calculate_shipping',
        orderId: orderDocument._id,
        shippingAddress: orderDocument.shipping_address,
        items: orderDocument.items
      }),

      this.notifyCustomerService({
        action: 'order_confirmation',
        orderId: orderDocument._id,
        customerId: orderDocument.user_id
      })
    ];

    // Execute coordination with error handling
    const results = await Promise.allSettled(coordinationTasks);

    // Check for coordination failures
    const failures = results.filter(result => result.status === 'rejected');
    if (failures.length > 0) {
      console.error(`Order coordination failures for ${orderDocument._id}:`, failures);

      // Trigger compensation workflow
      await this.triggerCompensationWorkflow(orderDocument._id, failures);
    }
  }

  async handleOrderStatusChange(orderDocument, newStatus) {
    console.log(`Order ${orderDocument._id} status changed to: ${newStatus}`);

    const statusHandlers = {
      'confirmed': async () => {
        await this.notifyFulfillmentService({
          action: 'prepare_order',
          orderId: orderDocument._id
        });
      },

      'shipped': async () => {
        await this.notifyCustomerService({
          action: 'shipping_notification',
          orderId: orderDocument._id,
          trackingNumber: orderDocument.tracking_number
        });

        // Update inventory
        await this.confirmInventoryConsumption(orderDocument);
      },

      'delivered': async () => {
        // Trigger post-delivery workflows
        await Promise.all([
          this.triggerReviewRequest(orderDocument),
          this.updateCustomerLoyaltyPoints(orderDocument),
          this.analyzeReorderProbability(orderDocument)
        ]);
      },

      'cancelled': async () => {
        // Execute cancellation compensation
        await this.executeOrderCancellation(orderDocument);
      }
    };

    const handler = statusHandlers[newStatus];
    if (handler) {
      await handler();
    }
  }

  async reserveInventoryForOrder(orderDocument) {
    console.log(`Reserving inventory for order: ${orderDocument._id}`);

    const inventoryOperations = orderDocument.items.map(item => ({
      updateOne: {
        filter: {
          product_id: item.product_id,
          available_quantity: { $gte: item.quantity }
        },
        update: {
          $inc: {
            available_quantity: -item.quantity,
            reserved_quantity: item.quantity
          },
          $push: {
            reservations: {
              order_id: orderDocument._id,
              quantity: item.quantity,
              reserved_at: new Date(),
              expires_at: new Date(Date.now() + 30 * 60 * 1000) // 30 minutes
            }
          }
        }
      }
    }));

    try {
      const result = await this.collections.inventory.bulkWrite(inventoryOperations);
      console.log(`Inventory reserved for ${result.modifiedCount} items`);

      // Check for insufficient inventory
      if (result.modifiedCount < orderDocument.items.length) {
        await this.handleInsufficientInventory(orderDocument, result);
      }

    } catch (error) {
      console.error(`Inventory reservation failed for order ${orderDocument._id}:`, error);
      throw error;
    }
  }

  async updateRealTimeMetrics(analyticsEvent) {
    console.log(`Updating real-time metrics for: ${analyticsEvent.eventType}`);

    const metricsUpdate = {
      $inc: {
        [`hourly_metrics.${new Date().getHours()}.${analyticsEvent.eventType}`]: 1
      },
      $push: {
        recent_events: {
          $each: [analyticsEvent],
          $slice: -1000 // Keep last 1000 events
        }
      },
      $set: {
        last_updated: new Date()
      }
    };

    // Update regional metrics
    if (analyticsEvent.data.region) {
      metricsUpdate.$inc[`regional_metrics.${analyticsEvent.data.region}.${analyticsEvent.eventType}`] = 1;
    }

    await this.collections.analytics.updateOne(
      { _id: 'real_time_metrics' },
      metricsUpdate,
      { upsert: true }
    );
  }

  async triggerPostPurchaseEngagement(orderDocument) {
    console.log(`Triggering post-purchase engagement for order: ${orderDocument._id}`);

    // Schedule engagement activities
    const engagementTasks = [
      {
        type: 'order_confirmation_email',
        scheduledFor: new Date(Date.now() + 5 * 60 * 1000), // 5 minutes
        recipient: orderDocument.user_id,
        data: { orderId: orderDocument._id }
      },
      {
        type: 'shipping_updates_subscription',
        scheduledFor: new Date(Date.now() + 60 * 60 * 1000), // 1 hour
        recipient: orderDocument.user_id,
        data: { orderId: orderDocument._id }
      },
      {
        type: 'product_recommendations',
        scheduledFor: new Date(Date.now() + 24 * 60 * 60 * 1000), // 24 hours
        recipient: orderDocument.user_id,
        data: { 
          orderId: orderDocument._id,
          purchasedProducts: orderDocument.items.map(item => item.product_id)
        }
      }
    ];

    await this.collections.engagement_queue.insertMany(engagementTasks);
  }

  // Microservice communication methods
  async notifyPaymentService(message) {
    // In production, this would use message queues (RabbitMQ, Apache Kafka, etc.)
    console.log('Notifying Payment Service:', message);

    // Simulate service call
    return new Promise((resolve) => {
      setTimeout(() => {
        console.log(`Payment service processed: ${message.action}`);
        resolve({ status: 'success', processedAt: new Date() });
      }, 100);
    });
  }

  async notifyInventoryService(message) {
    console.log('Notifying Inventory Service:', message);

    return new Promise((resolve) => {
      setTimeout(() => {
        console.log(`Inventory service processed: ${message.action}`);
        resolve({ status: 'success', processedAt: new Date() });
      }, 150);
    });
  }

  async notifyShippingService(message) {
    console.log('Notifying Shipping Service:', message);

    return new Promise((resolve) => {
      setTimeout(() => {
        console.log(`Shipping service processed: ${message.action}`);
        resolve({ status: 'success', processedAt: new Date() });
      }, 200);
    });
  }

  async notifyCustomerService(message) {
    console.log('Notifying Customer Service:', message);

    return new Promise((resolve) => {
      setTimeout(() => {
        console.log(`Customer service processed: ${message.action}`);
        resolve({ status: 'success', processedAt: new Date() });
      }, 75);
    });
  }

  // Utility methods
  calculateEventPriority(changeDoc) {
    // Priority scoring based on operation type and document characteristics
    const basePriority = {
      'insert': 10,
      'update': 5,
      'delete': 15,
      'replace': 8
    };

    let priority = basePriority[changeDoc.operationType] || 1;

    // Boost priority for high-value orders
    if (changeDoc.fullDocument?.total_amount > 1000) {
      priority += 5;
    }

    // Boost priority for status changes
    if (changeDoc.updateDescription?.updatedFields?.status) {
      priority += 3;
    }

    return priority;
  }

  generateCorrelationId(changeDoc) {
    return `${changeDoc.operationType}-${changeDoc.documentKey._id}-${Date.now()}`;
  }

  generateTraceId() {
    return require('crypto').randomUUID();
  }

  updateStreamStats(streamName, event) {
    const streamData = this.changeStreams.get(streamName);
    if (streamData) {
      streamData.stats.lastEventTime = new Date();

      switch (event) {
        case 'event_received':
          streamData.stats.eventsProcessed++;
          break;
        case 'error':
        case 'stream_error':
          streamData.stats.errors++;
          break;
      }
    }
  }

  updateProcessingMetrics(streamName, processingTime, success) {
    if (!this.processingStats.has(streamName)) {
      this.processingStats.set(streamName, {
        totalProcessed: 0,
        totalErrors: 0,
        totalProcessingTime: 0,
        avgProcessingTime: 0
      });
    }

    const stats = this.processingStats.get(streamName);

    if (success) {
      stats.totalProcessed++;
      stats.totalProcessingTime += processingTime;
      stats.avgProcessingTime = stats.totalProcessingTime / stats.totalProcessed;
    } else {
      stats.totalErrors++;
    }
  }

  getStreamStatus() {
    const status = {
      activeStreams: this.changeStreams.size,
      totalEventsProcessed: 0,
      totalErrors: 0,
      streams: {}
    };

    for (const [streamName, streamData] of this.changeStreams) {
      status.totalEventsProcessed += streamData.stats.eventsProcessed;
      status.totalErrors += streamData.stats.errors;

      status.streams[streamName] = {
        collection: streamData.collection,
        eventsProcessed: streamData.stats.eventsProcessed,
        errors: streamData.stats.errors,
        uptime: Date.now() - streamData.stats.startTime.getTime(),
        lastEventTime: streamData.stats.lastEventTime
      };
    }

    return status;
  }

  async resumeChangeStream(streamName) {
    const streamData = this.changeStreams.get(streamName);
    if (!streamData) return;

    console.log(`Resuming change stream: ${streamName}`);

    try {
      // Close current stream
      await streamData.stream.close();

      // Create new stream with resume token
      const resumeToken = this.resumeTokens.get(streamName);
      const config = {
        ...streamData.config,
        resumeToken: resumeToken
      };

      await this.createChangeStream(streamName, config);
      console.log(`Change stream ${streamName} resumed successfully`);

    } catch (error) {
      console.error(`Failed to resume change stream ${streamName}:`, error);
      this.emit('resume_failed', { streamName, error });
    }
  }

  async close() {
    console.log('Closing all change streams...');

    for (const [streamName, streamData] of this.changeStreams) {
      try {
        await streamData.stream.close();
        console.log(`Closed change stream: ${streamName}`);
      } catch (error) {
        console.error(`Error closing stream ${streamName}:`, error);
      }
    }

    this.changeStreams.clear();
    this.resumeTokens.clear();
    console.log('All change streams closed');
  }
}

// Example usage: Complete microservices coordination system
async function setupEcommerceEventProcessing() {
  console.log('Setting up comprehensive e-commerce event processing system...');

  const changeStreamManager = new MongoChangeStreamManager(db);

  // Configure change streams for different aspects of the system
  const streamConfigurations = {
    // Order lifecycle management
    orderEvents: {
      collection: 'orders',
      operationTypes: ['insert', 'update', 'delete'],
      processor: changeStreamManager.eventProcessors.get('orderLifecycle'),
      options: {
        fullDocument: 'updateLookup',
        fullDocumentBeforeChange: 'whenAvailable'
      }
    },

    // Inventory synchronization
    inventorySync: {
      collection: 'orders',
      operationTypes: ['insert', 'update'],
      documentFilter: {
        $or: [
          { operationType: 'insert' },
          { 'updateDescription.updatedFields.status': { $exists: true } }
        ]
      },
      processor: changeStreamManager.eventProcessors.get('inventorySync')
    },

    // Real-time analytics
    analyticsEvents: {
      collection: 'orders',
      processor: changeStreamManager.eventProcessors.get('realTimeAnalytics'),
      options: {
        fullDocument: 'updateLookup'
      }
    },

    // Customer engagement
    customerEngagement: {
      collection: 'orders',
      operationTypes: ['insert', 'update'],
      processor: changeStreamManager.eventProcessors.get('customerEngagement'),
      options: {
        fullDocument: 'updateLookup'
      }
    },

    // User profile updates
    userProfileSync: {
      collection: 'users',
      operationTypes: ['update'],
      documentFilter: {
        'updateDescription.updatedFields': {
          $or: [
            { 'email': { $exists: true } },
            { 'profile': { $exists: true } },
            { 'preferences': { $exists: true } }
          ]
        }
      },
      processor: async (context) => {
        console.log(`User profile updated: ${context.getDocumentId()}`);
        // Sync profile changes across microservices
        await changeStreamManager.notifyCustomerService({
          action: 'profile_sync',
          userId: context.getDocumentId(),
          changes: context.updateDescription.updatedFields
        });
      }
    }
  };

  // Initialize all change streams
  await changeStreamManager.initializeChangeStreams(streamConfigurations);

  // Monitor system health
  setInterval(() => {
    const status = changeStreamManager.getStreamStatus();
    console.log('Change Stream System Status:', JSON.stringify(status, null, 2));
  }, 30000); // Every 30 seconds

  return changeStreamManager;
}

// Benefits of MongoDB Change Streams:
// - Real-time, ordered change events with guaranteed delivery
// - Resume capability from any point using resume tokens
// - Rich filtering and transformation capabilities through aggregation pipelines
// - Automatic failover and reconnection handling
// - Document-level granularity with full document context
// - Cluster-wide change tracking across replica sets and sharded clusters
// - Built-in support for microservices coordination patterns
// - Efficient resource utilization without polling overhead
// - Comprehensive event metadata and processing context
// - SQL-compatible change processing through QueryLeaf integration

module.exports = {
  MongoChangeStreamManager,
  setupEcommerceEventProcessing
};

Understanding MongoDB Change Streams Architecture

Advanced Event-Driven Patterns and Microservices Coordination

Implement sophisticated change stream patterns for production-scale event processing:

// Production-grade change stream patterns for enterprise applications
class EnterpriseChangeStreamManager extends MongoChangeStreamManager {
  constructor(db, enterpriseConfig) {
    super(db);

    this.enterpriseConfig = {
      messageQueue: enterpriseConfig.messageQueue, // RabbitMQ, Kafka, etc.
      distributedTracing: enterpriseConfig.distributedTracing,
      metricsCollector: enterpriseConfig.metricsCollector,
      errorReporting: enterpriseConfig.errorReporting,
      circuitBreaker: enterpriseConfig.circuitBreaker
    };

    this.setupEnterpriseIntegrations();
  }

  async setupMultiTenantChangeStreams(tenantConfigurations) {
    console.log('Setting up multi-tenant change stream architecture...');

    const tenantStreams = new Map();

    for (const [tenantId, config] of Object.entries(tenantConfigurations)) {
      const tenantStreamConfig = {
        ...config,
        pipeline: [
          { $match: { 'fullDocument.tenant_id': tenantId } },
          ...(config.pipeline || [])
        ],
        processor: this.createTenantProcessor(tenantId, config.processor)
      };

      const streamName = `tenant_${tenantId}_${config.name}`;
      tenantStreams.set(streamName, tenantStreamConfig);
    }

    await this.initializeChangeStreams(Object.fromEntries(tenantStreams));
    return tenantStreams;
  }

  createTenantProcessor(tenantId, baseProcessor) {
    return async (context) => {
      // Add tenant context
      const tenantContext = {
        ...context,
        tenantId: tenantId,
        tenantConfig: await this.getTenantConfig(tenantId)
      };

      // Execute with tenant-specific error handling
      try {
        await baseProcessor(tenantContext);
      } catch (error) {
        await this.handleTenantError(tenantId, error, context);
      }
    };
  }

  async implementEventSourcingPattern(aggregateConfigs) {
    console.log('Implementing event sourcing pattern with change streams...');

    const eventSourcingStreams = {};

    for (const [aggregateName, config] of Object.entries(aggregateConfigs)) {
      eventSourcingStreams[`${aggregateName}_events`] = {
        collection: config.collection,
        operationTypes: ['insert', 'update', 'delete'],
        processor: async (context) => {
          const event = this.buildDomainEvent(aggregateName, context);

          // Store in event store
          await this.appendToEventStore(event);

          // Update projections
          await this.updateProjections(aggregateName, event);

          // Publish to event bus
          await this.publishDomainEvent(event);
        },
        options: {
          fullDocument: 'updateLookup',
          fullDocumentBeforeChange: 'whenAvailable'
        }
      };
    }

    return eventSourcingStreams;
  }

  buildDomainEvent(aggregateName, context) {
    const { operation, document, documentKey, updateDescription, timestamp } = context;

    return {
      eventId: require('crypto').randomUUID(),
      eventType: `${aggregateName}.${operation}`,
      aggregateId: documentKey._id,
      aggregateType: aggregateName,
      eventData: {
        before: context.fullDocumentBeforeChange,
        after: document,
        changes: updateDescription
      },
      eventMetadata: {
        timestamp: timestamp,
        causationId: context.metadata.correlationId,
        correlationId: context.metadata.traceId,
        userId: document?.user_id || 'system',
        version: await this.getAggregateVersion(aggregateName, documentKey._id)
      }
    };
  }

  async setupCQRSIntegration(cqrsConfig) {
    console.log('Setting up CQRS integration with change streams...');

    const cqrsStreams = {};

    // Command side - write model changes
    for (const [commandModel, config] of Object.entries(cqrsConfig.commandModels)) {
      cqrsStreams[`${commandModel}_commands`] = {
        collection: config.collection,
        processor: async (context) => {
          // Update read models
          await this.updateReadModels(commandModel, context);

          // Invalidate caches
          await this.invalidateReadModelCaches(commandModel, context.getDocumentId());

          // Publish integration events
          await this.publishIntegrationEvents(commandModel, context);
        }
      };
    }

    return cqrsStreams;
  }

  async setupDistributedSagaCoordination(sagaConfigurations) {
    console.log('Setting up distributed saga coordination...');

    const sagaStreams = {};

    for (const [sagaName, config] of Object.entries(sagaConfigurations)) {
      sagaStreams[`${sagaName}_saga`] = {
        collection: config.triggerCollection,
        documentFilter: config.triggerFilter,
        processor: async (context) => {
          const sagaInstance = await this.createSagaInstance(sagaName, context);
          await this.executeSagaStep(sagaInstance, context);
        }
      };
    }

    return sagaStreams;
  }

  async createSagaInstance(sagaName, triggerContext) {
    const sagaInstance = {
      sagaId: require('crypto').randomUUID(),
      sagaType: sagaName,
      status: 'started',
      currentStep: 0,
      triggerEvent: {
        aggregateId: triggerContext.getDocumentId(),
        eventData: triggerContext.document
      },
      compensation: [],
      createdAt: new Date()
    };

    await this.db.collection('saga_instances').insertOne(sagaInstance);
    return sagaInstance;
  }

  async setupAdvancedMonitoring() {
    console.log('Setting up advanced change stream monitoring...');

    const monitoringConfig = {
      healthChecks: {
        streamLiveness: true,
        processingLatency: true,
        errorRates: true,
        throughput: true
      },

      alerting: {
        streamFailure: { threshold: 1, window: '1m' },
        highLatency: { threshold: 5000, window: '5m' },
        errorRate: { threshold: 0.05, window: '10m' },
        lowThroughput: { threshold: 10, window: '5m' }
      },

      metrics: {
        prometheus: true,
        cloudwatch: false,
        datadog: false
      }
    };

    return this.initializeMonitoring(monitoringConfig);
  }
}

SQL-Style Change Stream Processing with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB change stream configuration and event processing:

-- QueryLeaf change stream management with SQL-familiar patterns

-- Create comprehensive change stream for order processing
CREATE CHANGE STREAM order_processing_stream ON orders
WATCH FOR (INSERT, UPDATE, DELETE)
WHERE 
  status IN ('pending', 'confirmed', 'shipped', 'delivered', 'cancelled')
  AND total_amount > 0
WITH OPTIONS (
  full_document = 'updateLookup',
  full_document_before_change = 'whenAvailable',
  batch_size = 100,
  max_await_time = 1000,
  start_at_operation_time = CURRENT_TIMESTAMP - INTERVAL '1 hour'
)
PROCESS WITH order_lifecycle_handler;

-- Advanced change stream with complex filtering and transformation
CREATE CHANGE STREAM high_value_order_stream ON orders
WATCH FOR (INSERT, UPDATE)
WHERE 
  operationType = 'insert' AND fullDocument.total_amount >= 1000
  OR (operationType = 'update' AND updateDescription.updatedFields.status EXISTS)
WITH PIPELINE (
  -- Stage 1: Additional filtering
  {
    $match: {
      $or: [
        { 
          operationType: 'insert',
          'fullDocument.customer_tier': { $in: ['gold', 'platinum'] }
        },
        {
          operationType: 'update',
          'fullDocument.total_amount': { $gte: 1000 }
        }
      ]
    }
  },

  -- Stage 2: Enrich with customer data
  {
    $lookup: {
      from: 'users',
      localField: 'fullDocument.user_id',
      foreignField: '_id',
      as: 'customer_data',
      pipeline: [
        {
          $project: {
            email: 1,
            customer_tier: 1,
            lifetime_value: 1,
            preferences: 1
          }
        }
      ]
    }
  },

  -- Stage 3: Calculate priority score
  {
    $addFields: {
      processing_priority: {
        $switch: {
          branches: [
            { 
              case: { $gte: ['$fullDocument.total_amount', 5000] }, 
              then: 'critical' 
            },
            { 
              case: { $gte: ['$fullDocument.total_amount', 2000] }, 
              then: 'high' 
            },
            { 
              case: { $gte: ['$fullDocument.total_amount', 1000] }, 
              then: 'medium' 
            }
          ],
          default: 'normal'
        }
      }
    }
  }
)
PROCESS WITH vip_order_processor;

-- Real-time analytics change stream with aggregation
CREATE MATERIALIZED CHANGE STREAM real_time_order_metrics ON orders
WATCH FOR (INSERT, UPDATE, DELETE)
WITH AGGREGATION (
  -- Group by time buckets for real-time metrics
  GROUP BY (
    DATE_TRUNC('minute', clusterTime, 5) as time_bucket,
    fullDocument.region as region
  )
  SELECT 
    time_bucket,
    region,

    -- Real-time KPIs
    COUNT(*) FILTER (WHERE operationType = 'insert') as new_orders,
    COUNT(*) FILTER (WHERE operationType = 'update' AND updateDescription.updatedFields.status = 'shipped') as orders_shipped,
    COUNT(*) FILTER (WHERE operationType = 'delete') as orders_cancelled,

    -- Revenue metrics
    SUM(fullDocument.total_amount) FILTER (WHERE operationType = 'insert') as new_revenue,
    AVG(fullDocument.total_amount) FILTER (WHERE operationType = 'insert') as avg_order_value,

    -- Customer metrics
    COUNT(DISTINCT fullDocument.user_id) as unique_customers,

    -- Performance indicators
    COUNT(*) / 5.0 as events_per_minute,
    CURRENT_TIMESTAMP as computed_at

  WINDOW (
    ORDER BY time_bucket
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  )
  ADD (
    AVG(new_orders) OVER window as rolling_avg_orders,
    AVG(new_revenue) OVER window as rolling_avg_revenue,

    -- Trend detection
    CASE 
      WHEN new_orders > rolling_avg_orders * 1.2 THEN 'surge'
      WHEN new_orders < rolling_avg_orders * 0.8 THEN 'decline'
      ELSE 'stable'
    END as order_trend
  )
)
REFRESH EVERY 5 SECONDS
PROCESS WITH analytics_event_handler;

-- Customer segmentation change stream with RFM analysis
CREATE CHANGE STREAM customer_behavior_analysis ON orders
WATCH FOR (INSERT, UPDATE)
WHERE fullDocument.status IN ('completed', 'delivered')
WITH CUSTOMER_SEGMENTATION (
  -- Calculate RFM metrics from change events
  SELECT 
    fullDocument.user_id as customer_id,

    -- Recency calculation
    EXTRACT(DAYS FROM CURRENT_TIMESTAMP - MAX(fullDocument.order_date)) as recency_days,

    -- Frequency calculation  
    COUNT(*) FILTER (WHERE operationType = 'insert') as order_frequency,

    -- Monetary calculation
    SUM(fullDocument.total_amount) as total_monetary_value,
    AVG(fullDocument.total_amount) as avg_order_value,

    -- Advanced behavior metrics
    COUNT(DISTINCT fullDocument.product_categories) as category_diversity,
    AVG(ARRAY_LENGTH(fullDocument.items)) as avg_items_per_order,

    -- Engagement patterns
    COUNT(*) FILTER (WHERE EXTRACT(DOW FROM fullDocument.order_date) IN (0, 6)) / COUNT(*)::float as weekend_preference,

    -- RFM scoring
    NTILE(5) OVER (ORDER BY recency_days DESC) as recency_score,
    NTILE(5) OVER (ORDER BY order_frequency ASC) as frequency_score,  
    NTILE(5) OVER (ORDER BY total_monetary_value ASC) as monetary_score,

    -- Customer segment classification
    CASE 
      WHEN NTILE(5) OVER (ORDER BY recency_days DESC) >= 4 
           AND NTILE(5) OVER (ORDER BY order_frequency ASC) >= 4 
           AND NTILE(5) OVER (ORDER BY total_monetary_value ASC) >= 4 THEN 'champions'
      WHEN NTILE(5) OVER (ORDER BY recency_days DESC) >= 3 
           AND NTILE(5) OVER (ORDER BY order_frequency ASC) >= 3 
           AND NTILE(5) OVER (ORDER BY total_monetary_value ASC) >= 3 THEN 'loyal_customers'
      WHEN NTILE(5) OVER (ORDER BY recency_days DESC) >= 4 
           AND NTILE(5) OVER (ORDER BY order_frequency ASC) <= 2 THEN 'potential_loyalists'
      WHEN NTILE(5) OVER (ORDER BY recency_days DESC) >= 4 
           AND NTILE(5) OVER (ORDER BY order_frequency ASC) <= 1 THEN 'new_customers'
      WHEN NTILE(5) OVER (ORDER BY recency_days DESC) <= 2 
           AND NTILE(5) OVER (ORDER BY order_frequency ASC) >= 3 THEN 'at_risk'
      ELSE 'needs_attention'
    END as customer_segment,

    -- Predictive metrics
    total_monetary_value / GREATEST(recency_days / 30.0, 1) * order_frequency as predicted_clv,

    CURRENT_TIMESTAMP as analyzed_at

  GROUP BY fullDocument.user_id
  WINDOW customer_analysis AS (
    PARTITION BY fullDocument.user_id
    ORDER BY fullDocument.order_date
    RANGE BETWEEN INTERVAL '365 days' PRECEDING AND CURRENT ROW
  )
)
PROCESS WITH customer_segmentation_handler;

-- Inventory synchronization change stream
CREATE CHANGE STREAM inventory_sync_stream ON orders  
WATCH FOR (INSERT, UPDATE, DELETE)
WHERE 
  operationType = 'insert' 
  OR (operationType = 'update' AND updateDescription.updatedFields.status EXISTS)
  OR operationType = 'delete'
WITH EVENT_PROCESSING (
  CASE operationType
    WHEN 'insert' THEN 
      CALL reserve_inventory(fullDocument.items, fullDocument._id)
    WHEN 'update' THEN
      CASE updateDescription.updatedFields.status
        WHEN 'cancelled' THEN 
          CALL release_inventory_reservation(fullDocument._id)
        WHEN 'shipped' THEN 
          CALL confirm_inventory_consumption(fullDocument._id)
        WHEN 'returned' THEN 
          CALL restore_inventory(fullDocument.items, fullDocument._id)
      END
    WHEN 'delete' THEN
      CALL cleanup_inventory_reservations(documentKey._id)
  END
)
WITH OPTIONS (
  retry_policy = {
    max_attempts: 3,
    backoff_strategy: 'exponential',
    base_delay: '1 second'
  },
  dead_letter_queue = 'inventory_sync_dlq',
  processing_timeout = '30 seconds'
)
PROCESS WITH inventory_coordination_handler;

-- Microservices event coordination with saga pattern
CREATE DISTRIBUTED SAGA order_fulfillment_saga 
TRIGGERED BY orders.insert
WHERE fullDocument.status = 'pending' AND fullDocument.total_amount > 0
WITH STEPS (
  -- Step 1: Payment processing
  {
    service: 'payment-service',
    action: 'process_payment',
    input: {
      order_id: NEW.documentKey._id,
      amount: NEW.fullDocument.total_amount,
      payment_method: NEW.fullDocument.payment_method
    },
    compensation: {
      service: 'payment-service', 
      action: 'refund_payment',
      input: { payment_id: '${payment_result.payment_id}' }
    },
    timeout: '30 seconds'
  },

  -- Step 2: Inventory reservation
  {
    service: 'inventory-service',
    action: 'reserve_products',
    input: {
      order_id: NEW.documentKey._id,
      items: NEW.fullDocument.items
    },
    compensation: {
      service: 'inventory-service',
      action: 'release_reservation', 
      input: { reservation_id: '${inventory_result.reservation_id}' }
    },
    timeout: '15 seconds'
  },

  -- Step 3: Shipping calculation
  {
    service: 'shipping-service',
    action: 'calculate_shipping',
    input: {
      order_id: NEW.documentKey._id,
      shipping_address: NEW.fullDocument.shipping_address,
      items: NEW.fullDocument.items
    },
    compensation: {
      service: 'shipping-service',
      action: 'cancel_shipping',
      input: { shipping_id: '${shipping_result.shipping_id}' }
    },
    timeout: '10 seconds'
  },

  -- Step 4: Order confirmation
  {
    service: 'notification-service',
    action: 'send_confirmation',
    input: {
      order_id: NEW.documentKey._id,
      customer_email: NEW.fullDocument.customer_email,
      order_details: NEW.fullDocument
    },
    timeout: '5 seconds'
  }
)
WITH SAGA_OPTIONS (
  max_retry_attempts = 3,
  compensation_timeout = '60 seconds',
  saga_timeout = '5 minutes'
);

-- Event sourcing pattern with change streams
CREATE EVENT STORE order_events
FROM CHANGE STREAM orders.*
WITH EVENT_MAPPING (
  event_type = CONCAT('Order.', TITLE_CASE(operationType)),
  aggregate_id = documentKey._id,
  aggregate_type = 'Order',
  event_data = {
    before: fullDocumentBeforeChange,
    after: fullDocument,
    changes: updateDescription
  },
  event_metadata = {
    timestamp: clusterTime,
    causation_id: correlation_id,
    correlation_id: trace_id,
    user_id: COALESCE(fullDocument.user_id, 'system'),
    version: aggregate_version + 1
  }
)
WITH PROJECTIONS (
  -- Order summary projection
  order_summary = {
    aggregate_id: aggregate_id,
    current_status: event_data.after.status,
    total_amount: event_data.after.total_amount,
    created_at: event_data.after.created_at,
    last_updated: event_metadata.timestamp,
    version: event_metadata.version
  },

  -- Customer order history projection  
  customer_orders = {
    customer_id: event_data.after.user_id,
    order_id: aggregate_id,
    order_amount: event_data.after.total_amount,
    order_date: event_data.after.created_at,
    status: event_data.after.status
  }
);

-- Advanced monitoring and alerting for change streams
CREATE CHANGE STREAM MONITOR comprehensive_monitoring
WITH METRICS (
  -- Stream health metrics
  stream_uptime,
  events_processed_per_second,
  processing_latency_p95,
  error_rate,
  resume_token_age,

  -- Business metrics
  high_value_orders_per_minute,
  average_processing_time,
  failed_event_count,

  -- System resource metrics
  memory_usage,
  cpu_utilization,
  network_throughput
)
WITH ALERTS (
  -- Critical alerts
  stream_disconnected = {
    condition: stream_uptime = 0,
    severity: 'critical',
    notification: ['pager', 'slack:#ops-critical']
  },

  high_error_rate = {
    condition: error_rate > 0.05 FOR 5 MINUTES,
    severity: 'high', 
    notification: ['email:ops-team@company.com', 'slack:#database-alerts']
  },

  processing_latency = {
    condition: processing_latency_p95 > 5000 FOR 3 MINUTES,
    severity: 'medium',
    notification: ['slack:#performance-alerts']
  },

  -- Business alerts
  revenue_drop = {
    condition: high_value_orders_per_minute < 10 FOR 10 MINUTES DURING BUSINESS_HOURS,
    severity: 'high',
    notification: ['email:business-ops@company.com']
  }
);

-- QueryLeaf provides comprehensive change stream capabilities:
-- 1. SQL-familiar syntax for MongoDB change stream creation and management
-- 2. Advanced filtering and transformation through aggregation pipelines
-- 3. Real-time analytics and materialized views from change events
-- 4. Customer segmentation and behavioral analysis integration
-- 5. Microservices coordination with distributed saga patterns
-- 6. Event sourcing and CQRS implementation support
-- 7. Comprehensive monitoring and alerting for production environments
-- 8. Inventory synchronization and business process automation
-- 9. Multi-tenant and enterprise-grade change stream management
-- 10. Integration with external message queues and event systems

Best Practices for Change Stream Implementation

Event-Driven Architecture Design

Essential principles for building robust change stream-based systems:

  1. Resume Token Management: Always store resume tokens for fault tolerance and recovery
  2. Event Processing Idempotency: Design event processors to handle duplicate events gracefully
  3. Error Handling Strategy: Implement comprehensive error handling with retry policies and dead letter queues
  4. Filtering Optimization: Use early filtering in change stream pipelines to reduce processing overhead
  5. Resource Management: Monitor and manage memory usage for long-running change streams
  6. Monitoring Integration: Implement comprehensive monitoring for stream health and processing metrics

Production Deployment Strategies

Optimize change stream deployments for production-scale environments:

  1. High Availability: Deploy change stream processors across multiple instances with proper load balancing
  2. Scaling Patterns: Implement horizontal scaling strategies for high-throughput scenarios
  3. Performance Monitoring: Track processing latency, throughput, and error rates continuously
  4. Security Considerations: Ensure proper authentication and authorization for change stream access
  5. Backup and Recovery: Implement comprehensive backup strategies for resume tokens and processing state
  6. Integration Testing: Thoroughly test change stream integrations with downstream systems

Conclusion

MongoDB Change Streams provide a powerful foundation for building sophisticated event-driven architectures that enable real-time data processing, microservices coordination, and reactive application patterns. The ordered, resumable stream of change events eliminates the complexity and limitations of traditional change detection approaches while providing comprehensive filtering, transformation, and integration capabilities.

Key MongoDB Change Streams benefits include:

  • Real-time Processing: Immediate notification of data changes without polling overhead
  • Fault Tolerance: Resume capability from any point using resume tokens with guaranteed delivery
  • Rich Context: Complete document context with before/after states for comprehensive processing
  • Scalable Architecture: Horizontal scaling support for high-throughput event processing scenarios
  • Microservices Integration: Native support for distributed system coordination and communication patterns
  • Flexible Filtering: Advanced aggregation pipeline integration for sophisticated event filtering and transformation

Whether you're building real-time analytics platforms, microservices architectures, event sourcing systems, or reactive applications, MongoDB Change Streams with QueryLeaf's familiar SQL interface provide the foundation for modern event-driven development.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB change stream operations while providing SQL-familiar syntax for event processing, microservices coordination, and real-time analytics. Advanced change stream patterns, saga orchestration, and event sourcing capabilities are seamlessly accessible through familiar SQL constructs, making sophisticated event-driven architectures both powerful and approachable for SQL-oriented development teams.

The combination of MongoDB's robust change stream capabilities with SQL-style operations makes it an ideal platform for modern applications requiring real-time responsiveness and distributed system coordination, ensuring your event-driven architectures can scale efficiently while maintaining consistency and reliability across complex distributed topologies.

MongoDB Indexing Strategies and Compound Indexes: Advanced Performance Optimization for Scalable Database Operations

Database performance at scale depends heavily on effective indexing strategies that can efficiently support diverse query patterns while minimizing storage overhead and maintenance costs. Poor indexing decisions lead to slow query performance, excessive resource consumption, and degraded user experience that becomes increasingly problematic as data volumes and application complexity grow.

MongoDB's sophisticated indexing system provides comprehensive support for simple and compound indexes, partial filters, text search indexes, and specialized data type indexes that enable developers to optimize query performance for complex application requirements. Unlike traditional database systems with rigid indexing constraints, MongoDB's flexible indexing architecture supports dynamic schema requirements while providing powerful optimization capabilities through compound indexes, index intersection, and advanced filtering strategies.

The Traditional Database Indexing Limitations

Conventional database indexing approaches often struggle with complex query patterns and multi-dimensional data access requirements:

-- Traditional PostgreSQL indexing with limited flexibility and optimization challenges

-- Basic single-column indexes with poor compound query support
CREATE INDEX idx_users_email ON users (email);
CREATE INDEX idx_users_status ON users (status);
CREATE INDEX idx_users_created_at ON users (created_at);
CREATE INDEX idx_users_country ON users (country);

-- Simple compound index with fixed column order limitations
CREATE INDEX idx_users_status_country ON users (status, country);

-- Complex query requiring multiple index scans and poor optimization
SELECT 
  u.user_id,
  u.email,
  u.first_name,
  u.last_name,
  u.status,
  u.country,
  u.created_at,
  u.last_login_at,
  COUNT(o.order_id) as order_count,
  SUM(o.total_amount) as total_spent,
  MAX(o.order_date) as last_order_date
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
WHERE u.status IN ('active', 'premium', 'trial')
  AND u.country IN ('US', 'CA', 'UK', 'AU', 'DE', 'FR')
  AND u.created_at >= CURRENT_DATE - INTERVAL '2 years'
  AND u.last_login_at >= CURRENT_DATE - INTERVAL '30 days'
  AND (u.email LIKE '%@gmail.com' OR u.email LIKE '%@hotmail.com')
  AND u.subscription_tier IS NOT NULL
GROUP BY u.user_id, u.email, u.first_name, u.last_name, u.status, u.country, u.created_at, u.last_login_at
HAVING COUNT(o.order_id) > 0
ORDER BY total_spent DESC, last_order_date DESC
LIMIT 100;

-- PostgreSQL EXPLAIN showing inefficient index usage:
-- 
-- Limit  (cost=45234.67..45234.92 rows=100 width=128) (actual time=1247.123..1247.189 rows=100 loops=1)
--   ->  Sort  (cost=45234.67..45789.23 rows=221824 width=128) (actual time=1247.121..1247.156 rows=100 loops=1)
--         Sort Key: (sum(o.total_amount)) DESC, (max(o.order_date)) DESC
--         Sort Method: top-N heapsort  Memory: 67kB
--         ->  HashAggregate  (cost=38234.56..40456.80 rows=221824 width=128) (actual time=1156.789..1201.234 rows=12789 loops=1)
--               Group Key: u.user_id, u.email, u.first_name, u.last_name, u.status, u.country, u.created_at, u.last_login_at
--               ->  Hash Left Join  (cost=12345.67..32890.45 rows=221824 width=96) (actual time=89.456..567.123 rows=87645 loops=1)
--                     Hash Cond: (u.user_id = o.user_id)
--                     ->  Bitmap Heap Scan on users u  (cost=3456.78..8901.23 rows=45678 width=88) (actual time=34.567..123.456 rows=23456 loops=1)
--                           Recheck Cond: ((status = ANY ('{active,premium,trial}'::text[])) AND 
--                                         (country = ANY ('{US,CA,UK,AU,DE,FR}'::text[])) AND 
--                                         (created_at >= (CURRENT_DATE - '2 years'::interval)) AND 
--                                         (last_login_at >= (CURRENT_DATE - '30 days'::interval)))
--                           Filter: ((subscription_tier IS NOT NULL) AND 
--                                   ((email ~~ '%@gmail.com'::text) OR (email ~~ '%@hotmail.com'::text)))
--                           Rows Removed by Filter: 12789
--                           Heap Blocks: exact=1234 lossy=234
--                           ->  BitmapOr  (cost=3456.78..3456.78 rows=45678 width=0) (actual time=33.890..33.891 rows=0 loops=1)
--                                 ->  Bitmap Index Scan on idx_users_status_country  (cost=0.00..1234.56 rows=15678 width=0) (actual time=12.345..12.345 rows=18901 loops=1)
--                                       Index Cond: ((status = ANY ('{active,premium,trial}'::text[])) AND 
--                                                   (country = ANY ('{US,CA,UK,AU,DE,FR}'::text[])))
--                                 ->  Bitmap Index Scan on idx_users_created_at  (cost=0.00..1890.23 rows=25678 width=0) (actual time=18.234..18.234 rows=34567 loops=1)
--                                       Index Cond: (created_at >= (CURRENT_DATE - '2 years'::interval))
--                                 ->  Bitmap Index Scan on idx_users_last_login  (cost=0.00..331.99 rows=4322 width=0) (actual time=3.311..3.311 rows=8765 loops=1)
--                                       Index Cond: (last_login_at >= (CURRENT_DATE - '30 days'::interval))
--                     ->  Hash  (cost=7890.45..7890.45 rows=234567 width=24) (actual time=54.889..54.889 rows=198765 loops=1)
--                           Buckets: 262144  Batches: 1  Memory Usage: 11234kB
--                           ->  Seq Scan on orders o  (cost=0.00..7890.45 rows=234567 width=24) (actual time=0.234..28.901 rows=198765 loops=1)
-- Planning Time: 4.567 ms
-- Execution Time: 1247.567 ms

-- Problems with traditional PostgreSQL indexing:
-- 1. Multiple bitmap index scans required due to lack of comprehensive compound index
-- 2. Expensive BitmapOr operations combining multiple index results
-- 3. Large number of rows removed by filter conditions not supported by indexes
-- 4. Complex compound indexes difficult to design for multiple query patterns
-- 5. Index bloat and maintenance overhead with many single-column indexes
-- 6. Poor support for partial indexes and conditional filtering
-- 7. Limited flexibility in query optimization and index selection
-- 8. Difficulty optimizing for mixed equality/range/pattern matching conditions

-- Attempt to create better compound index
CREATE INDEX idx_users_comprehensive ON users (
  status, country, created_at, last_login_at, subscription_tier, email
);

-- Problems with large compound indexes:
-- 1. Index becomes very large and expensive to maintain
-- 2. Only efficient for queries that follow exact prefix patterns
-- 3. Wasted space for queries that don't use all index columns
-- 4. Update performance degradation due to large index maintenance
-- 5. Limited effectiveness for partial field matching (email patterns)
-- 6. Poor selectivity when early columns have low cardinality

-- MySQL limitations are even more restrictive
CREATE INDEX idx_users_limited ON users (status, country, created_at);
-- MySQL compound index limitations:
-- - Maximum 16 columns per compound index
-- - 767 bytes total key length limit (InnoDB)
-- - Poor optimization for range queries on non-leading columns
-- - Limited partial index support
-- - Inefficient covering index implementation
-- - Basic query optimizer with limited compound index utilization

-- Alternative approach with covering indexes (PostgreSQL)
CREATE INDEX idx_users_covering ON users (status, country, created_at) 
INCLUDE (email, first_name, last_name, last_login_at, subscription_tier);

-- Covering index problems:
-- 1. Large storage overhead for included columns
-- 2. Still limited by leading column selectivity
-- 3. Expensive maintenance operations
-- 4. Complex index design decisions
-- 5. Poor performance for non-matching query patterns

MongoDB provides sophisticated compound indexing with flexible optimization:

// MongoDB Advanced Indexing Strategies - comprehensive compound index management and optimization
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('advanced_ecommerce_platform');

// Advanced MongoDB indexing strategy and compound index optimization system
class MongoIndexOptimizer {
  constructor(db) {
    this.db = db;
    this.collections = {
      users: db.collection('users'),
      orders: db.collection('orders'),
      products: db.collection('products'),
      analytics: db.collection('analytics'),
      sessions: db.collection('sessions')
    };

    // Index optimization configuration
    this.indexingStrategies = {
      equalityFirst: true,        // ESR pattern - Equality, Sort, Range
      sortOptimization: true,     // Optimize for sort operations
      partialIndexes: true,       // Use partial indexes for selective filtering
      coveringIndexes: true,      // Create covering indexes where beneficial
      textSearchIndexes: true,    // Advanced text search capabilities
      geospatialIndexes: true,    // Location-based indexing
      ttlIndexes: true           // Time-based data expiration
    };

    this.performanceTargets = {
      maxQueryTimeMs: 100,
      minIndexSelectivity: 0.1,
      maxIndexSizeMB: 500,
      maxIndexesPerCollection: 10
    };

    this.indexAnalytics = new Map();
  }

  async implementComprehensiveIndexingStrategy(collectionName, queryPatterns) {
    console.log(`Implementing comprehensive indexing strategy for ${collectionName}...`);

    const collection = this.collections[collectionName];
    const existingIndexes = await collection.listIndexes().toArray();

    const indexingPlan = {
      collection: collectionName,
      queryPatterns: queryPatterns,
      existingIndexes: existingIndexes,
      recommendedIndexes: [],
      optimizationActions: [],
      performanceProjections: {}
    };

    // Analyze query patterns for optimal index design
    const queryAnalysis = await this.analyzeQueryPatterns(queryPatterns);

    // Generate compound index recommendations
    const compoundIndexes = await this.generateCompoundIndexes(queryAnalysis);

    // Design partial indexes for selective filtering
    const partialIndexes = await this.generatePartialIndexes(queryAnalysis);

    // Create covering indexes for frequently accessed projections
    const coveringIndexes = await this.generateCoveringIndexes(queryAnalysis);

    // Specialized indexes for specific data types and operations
    const specializedIndexes = await this.generateSpecializedIndexes(queryAnalysis);

    indexingPlan.recommendedIndexes = [
      ...compoundIndexes,
      ...partialIndexes, 
      ...coveringIndexes,
      ...specializedIndexes
    ];

    // Validate index recommendations against performance targets
    const validatedPlan = await this.validateIndexingPlan(collection, indexingPlan);

    // Execute index creation with comprehensive monitoring
    const implementationResult = await this.executeIndexingPlan(collection, validatedPlan);

    // Performance validation and optimization
    const performanceValidation = await this.validateIndexPerformance(collection, validatedPlan, queryPatterns);

    return {
      plan: validatedPlan,
      implementation: implementationResult,
      performance: performanceValidation,
      summary: {
        totalIndexes: validatedPlan.recommendedIndexes.length,
        compoundIndexes: compoundIndexes.length,
        partialIndexes: partialIndexes.length,
        coveringIndexes: coveringIndexes.length,
        specializedIndexes: specializedIndexes.length,
        estimatedPerformanceImprovement: this.calculatePerformanceImprovement(validatedPlan)
      }
    };
  }

  async analyzeQueryPatterns(queryPatterns) {
    console.log(`Analyzing ${queryPatterns.length} query patterns for index optimization...`);

    const analysis = {
      fieldUsage: new Map(),           // How often each field is used
      fieldCombinations: new Map(),    // Common field combinations
      filterTypes: new Map(),          // Types of filters (equality, range, etc.)
      sortPatterns: new Map(),         // Sort field combinations
      projectionPatterns: new Map(),   // Frequently requested projections
      selectivityEstimates: new Map()  // Estimated field selectivity
    };

    for (const pattern of queryPatterns) {
      // Analyze filter conditions
      this.analyzeFilterConditions(pattern.filter || {}, analysis);

      // Analyze sort requirements
      this.analyzeSortPatterns(pattern.sort || {}, analysis);

      // Analyze projection requirements
      this.analyzeProjectionPatterns(pattern.projection || {}, analysis);

      // Track query frequency for weighting
      const frequency = pattern.frequency || 1;
      this.updateFrequencyWeights(analysis, frequency);
    }

    // Calculate field selectivity estimates
    await this.estimateFieldSelectivity(analysis);

    // Identify optimal field combinations
    const optimalCombinations = this.identifyOptimalFieldCombinations(analysis);

    return {
      ...analysis,
      optimalCombinations: optimalCombinations,
      indexingRecommendations: this.generateIndexingRecommendations(analysis, optimalCombinations)
    };
  }

  analyzeFilterConditions(filter, analysis) {
    Object.entries(filter).forEach(([field, condition]) => {
      if (field.startsWith('$')) return; // Skip operators

      // Track field usage frequency
      const currentUsage = analysis.fieldUsage.get(field) || 0;
      analysis.fieldUsage.set(field, currentUsage + 1);

      // Categorize filter types
      const filterType = this.categorizeFilterType(condition);
      const currentFilterTypes = analysis.filterTypes.get(field) || new Set();
      currentFilterTypes.add(filterType);
      analysis.filterTypes.set(field, currentFilterTypes);

      // Track field combinations for compound indexes
      const otherFields = Object.keys(filter).filter(f => f !== field && !f.startsWith('$'));
      if (otherFields.length > 0) {
        const combination = [field, ...otherFields].sort().join(',');
        const currentCombinations = analysis.fieldCombinations.get(combination) || 0;
        analysis.fieldCombinations.set(combination, currentCombinations + 1);
      }
    });
  }

  categorizeFilterType(condition) {
    if (typeof condition === 'object' && condition !== null) {
      const operators = Object.keys(condition);

      if (operators.includes('$gte') || operators.includes('$gt') || 
          operators.includes('$lte') || operators.includes('$lt')) {
        return 'range';
      } else if (operators.includes('$in')) {
        return condition.$in.length <= 10 ? 'selective_in' : 'large_in';
      } else if (operators.includes('$regex')) {
        return 'pattern_match';
      } else if (operators.includes('$exists')) {
        return 'existence';
      } else if (operators.includes('$ne')) {
        return 'negation';
      } else {
        return 'complex';
      }
    } else {
      return 'equality';
    }
  }

  analyzeSortPatterns(sort, analysis) {
    if (Object.keys(sort).length === 0) return;

    const sortKey = Object.entries(sort)
      .map(([field, direction]) => `${field}:${direction}`)
      .join(',');

    const currentSort = analysis.sortPatterns.get(sortKey) || 0;
    analysis.sortPatterns.set(sortKey, currentSort + 1);
  }

  analyzeProjectionPatterns(projection, analysis) {
    if (!projection || Object.keys(projection).length === 0) return;

    const projectedFields = Object.keys(projection).filter(field => projection[field] === 1);
    const projectionKey = projectedFields.sort().join(',');

    if (projectionKey) {
      const currentProjection = analysis.projectionPatterns.get(projectionKey) || 0;
      analysis.projectionPatterns.set(projectionKey, currentProjection + 1);
    }
  }

  async generateCompoundIndexes(analysis) {
    console.log('Generating optimal compound index recommendations...');

    const compoundIndexes = [];

    // Sort field combinations by frequency and potential impact
    const sortedCombinations = Array.from(analysis.fieldCombinations.entries())
      .sort(([, a], [, b]) => b - a)
      .slice(0, 20); // Consider top 20 combinations

    for (const [fieldCombination, frequency] of sortedCombinations) {
      const fields = fieldCombination.split(',');

      // Apply ESR (Equality, Sort, Range) pattern optimization
      const optimizedIndex = this.optimizeIndexWithESRPattern(fields, analysis);

      if (optimizedIndex && this.validateIndexUtility(optimizedIndex, analysis)) {
        compoundIndexes.push({
          type: 'compound',
          name: `idx_${optimizedIndex.fields.map(f => f.field).join('_')}`,
          specification: this.buildIndexSpecification(optimizedIndex.fields),
          options: optimizedIndex.options,
          reasoning: optimizedIndex.reasoning,
          estimatedImpact: this.estimateIndexImpact(optimizedIndex, analysis),
          queryPatterns: this.identifyMatchingQueries(optimizedIndex, analysis),
          priority: this.calculateIndexPriority(optimizedIndex, frequency, analysis)
        });
      }
    }

    // Sort by priority and return top recommendations
    return compoundIndexes
      .sort((a, b) => b.priority - a.priority)
      .slice(0, this.performanceTargets.maxIndexesPerCollection);
  }

  optimizeIndexWithESRPattern(fields, analysis) {
    console.log(`Optimizing index for fields: ${fields.join(', ')} using ESR pattern...`);

    const optimizedFields = [];
    const fieldAnalysis = new Map();

    // Analyze each field's characteristics
    fields.forEach(field => {
      const filterTypes = analysis.filterTypes.get(field) || new Set();
      const usage = analysis.fieldUsage.get(field) || 0;
      const selectivity = analysis.selectivityEstimates.get(field) || 0.5;

      fieldAnalysis.set(field, {
        filterTypes: Array.from(filterTypes),
        usage: usage,
        selectivity: selectivity,
        isEquality: filterTypes.has('equality') || filterTypes.has('selective_in'),
        isRange: filterTypes.has('range'),
        isSort: this.isFieldUsedInSort(field, analysis),
        sortDirection: this.getSortDirection(field, analysis)
      });
    });

    // Step 1: Equality fields first (highest selectivity first)
    const equalityFields = fields
      .filter(field => fieldAnalysis.get(field).isEquality)
      .sort((a, b) => fieldAnalysis.get(b).selectivity - fieldAnalysis.get(a).selectivity);

    equalityFields.forEach(field => {
      const fieldInfo = fieldAnalysis.get(field);
      optimizedFields.push({
        field: field,
        direction: 1,
        type: 'equality',
        selectivity: fieldInfo.selectivity,
        reasoning: `Equality filter with ${(fieldInfo.selectivity * 100).toFixed(1)}% selectivity`
      });
    });

    // Step 2: Sort fields (maintaining sort direction)
    const sortFields = fields
      .filter(field => fieldAnalysis.get(field).isSort && !fieldAnalysis.get(field).isEquality)
      .sort((a, b) => fieldAnalysis.get(b).usage - fieldAnalysis.get(a).usage);

    sortFields.forEach(field => {
      const fieldInfo = fieldAnalysis.get(field);
      optimizedFields.push({
        field: field,
        direction: fieldInfo.sortDirection || 1,
        type: 'sort',
        selectivity: fieldInfo.selectivity,
        reasoning: `Sort field with ${fieldInfo.usage} usage frequency`
      });
    });

    // Step 3: Range fields last (lowest selectivity impact)
    const rangeFields = fields
      .filter(field => fieldAnalysis.get(field).isRange && 
                      !fieldAnalysis.get(field).isEquality && 
                      !fieldAnalysis.get(field).isSort)
      .sort((a, b) => fieldAnalysis.get(b).selectivity - fieldAnalysis.get(a).selectivity);

    rangeFields.forEach(field => {
      const fieldInfo = fieldAnalysis.get(field);
      optimizedFields.push({
        field: field,
        direction: 1,
        type: 'range',
        selectivity: fieldInfo.selectivity,
        reasoning: `Range filter with ${(fieldInfo.selectivity * 100).toFixed(1)}% selectivity`
      });
    });

    // Validate and return optimized index
    if (optimizedFields.length === 0) return null;

    return {
      fields: optimizedFields,
      options: this.generateIndexOptions(optimizedFields, analysis),
      reasoning: `ESR-optimized compound index: ${optimizedFields.length} fields arranged for optimal query performance`,
      estimatedSelectivity: this.calculateCompoundSelectivity(optimizedFields),
      supportedQueryTypes: this.identifySupportedQueryTypes(optimizedFields, analysis)
    };
  }

  async generatePartialIndexes(analysis) {
    console.log('Generating partial index recommendations for selective filtering...');

    const partialIndexes = [];

    // Identify fields with high selectivity potential
    const selectiveFields = Array.from(analysis.selectivityEstimates.entries())
      .filter(([field, selectivity]) => selectivity < this.performanceTargets.minIndexSelectivity)
      .sort(([, a], [, b]) => a - b); // Lower selectivity first (more selective)

    for (const [field, selectivity] of selectiveFields) {
      const filterTypes = analysis.filterTypes.get(field) || new Set();
      const usage = analysis.fieldUsage.get(field) || 0;

      // Generate partial filter conditions
      const partialFilters = this.generatePartialFilterConditions(field, filterTypes, analysis);

      for (const partialFilter of partialFilters) {
        const partialIndex = {
          type: 'partial',
          name: `idx_${field}_${partialFilter.suffix}`,
          specification: { [field]: 1 },
          options: {
            partialFilterExpression: partialFilter.expression,
            background: true
          },
          reasoning: partialFilter.reasoning,
          estimatedReduction: partialFilter.estimatedReduction,
          applicableQueries: partialFilter.applicableQueries,
          priority: this.calculatePartialIndexPriority(field, usage, selectivity, partialFilter)
        };

        if (this.validatePartialIndexUtility(partialIndex, analysis)) {
          partialIndexes.push(partialIndex);
        }
      }
    }

    return partialIndexes
      .sort((a, b) => b.priority - a.priority)
      .slice(0, Math.floor(this.performanceTargets.maxIndexesPerCollection / 3));
  }

  generatePartialFilterConditions(field, filterTypes, analysis) {
    const partialFilters = [];

    // Status/category fields with selective values
    if (filterTypes.has('equality') || filterTypes.has('selective_in')) {
      partialFilters.push({
        expression: { [field]: { $in: ['active', 'premium', 'verified'] } },
        suffix: 'active_premium',
        reasoning: `Partial index for high-value ${field} categories`,
        estimatedReduction: 0.7,
        applicableQueries: [`${field} equality matches for active/premium users`]
      });
    }

    // Date fields with recency focus
    if (filterTypes.has('range') && field.includes('date') || field.includes('time')) {
      partialFilters.push({
        expression: { [field]: { $gte: new Date(Date.now() - 90 * 24 * 60 * 60 * 1000) } },
        suffix: 'recent_90d',
        reasoning: `Partial index for recent ${field} within 90 days`,
        estimatedReduction: 0.8,
        applicableQueries: [`Recent ${field} range queries`]
      });
    }

    // Numeric fields with value thresholds
    if (filterTypes.has('range') && (field.includes('amount') || field.includes('count') || field.includes('score'))) {
      partialFilters.push({
        expression: { [field]: { $gt: 0 } },
        suffix: 'positive_values',
        reasoning: `Partial index excluding zero/null ${field} values`,
        estimatedReduction: 0.6,
        applicableQueries: [`${field} range queries for positive values`]
      });
    }

    return partialFilters;
  }

  async generateCoveringIndexes(analysis) {
    console.log('Generating covering index recommendations for query optimization...');

    const coveringIndexes = [];

    // Analyze projection patterns to identify covering index opportunities
    const projectionAnalysis = Array.from(analysis.projectionPatterns.entries())
      .sort(([, a], [, b]) => b - a)
      .slice(0, 10); // Top 10 projection patterns

    for (const [projectionKey, frequency] of projectionAnalysis) {
      const projectedFields = projectionKey.split(',');

      // Find queries that could benefit from covering indexes
      const candidateQueries = this.identifyConveringIndexCandidates(projectedFields, analysis);

      if (candidateQueries.length > 0) {
        const coveringIndex = this.designCoveringIndex(projectedFields, candidateQueries, analysis);

        if (coveringIndex && this.validateCoveringIndexBenefit(coveringIndex, analysis)) {
          coveringIndexes.push({
            type: 'covering',
            name: `idx_covering_${coveringIndex.keyFields.join('_')}`,
            specification: coveringIndex.specification,
            options: coveringIndex.options,
            reasoning: coveringIndex.reasoning,
            coveredQueries: candidateQueries.length,
            projectedFields: projectedFields,
            estimatedImpact: this.estimateCoveringIndexImpact(coveringIndex, frequency),
            priority: this.calculateCoveringIndexPriority(coveringIndex, frequency, candidateQueries.length)
          });
        }
      }
    }

    return coveringIndexes
      .sort((a, b) => b.priority - a.priority)
      .slice(0, Math.floor(this.performanceTargets.maxIndexesPerCollection / 4));
  }

  designCoveringIndex(projectedFields, candidateQueries, analysis) {
    // Analyze filter and sort patterns from candidate queries
    const filterFields = new Set();
    const sortFields = new Map();

    candidateQueries.forEach(query => {
      Object.keys(query.filter || {}).forEach(field => {
        if (!field.startsWith('$')) {
          filterFields.add(field);
        }
      });

      Object.entries(query.sort || {}).forEach(([field, direction]) => {
        sortFields.set(field, direction);
      });
    });

    // Design optimal key structure
    const keyFields = [];
    const includeFields = [];

    // Add filter fields to key (equality first, then range)
    const equalityFields = Array.from(filterFields).filter(field => {
      const filterTypes = analysis.filterTypes.get(field) || new Set();
      return filterTypes.has('equality') || filterTypes.has('selective_in');
    });

    const rangeFields = Array.from(filterFields).filter(field => {
      const filterTypes = analysis.filterTypes.get(field) || new Set();
      return filterTypes.has('range');
    });

    // Add equality fields to key
    equalityFields.forEach(field => {
      keyFields.push(field);
    });

    // Add sort fields to key
    sortFields.forEach((direction, field) => {
      if (!keyFields.includes(field)) {
        keyFields.push(field);
      }
    });

    // Add range fields to key
    rangeFields.forEach(field => {
      if (!keyFields.includes(field)) {
        keyFields.push(field);
      }
    });

    // Add remaining projected fields as included fields
    projectedFields.forEach(field => {
      if (!keyFields.includes(field)) {
        includeFields.push(field);
      }
    });

    if (keyFields.length === 0) return null;

    // Build index specification
    const specification = {};
    keyFields.forEach(field => {
      const direction = sortFields.get(field) || 1;
      specification[field] = direction;
    });

    return {
      keyFields: keyFields,
      includeFields: includeFields,
      specification: specification,
      options: {
        background: true,
        // Include non-key fields for covering capability
        ...(includeFields.length > 0 && { includeFields: includeFields })
      },
      reasoning: `Covering index with ${keyFields.length} key fields and ${includeFields.length} included fields`,
      estimatedCoverage: this.calculateQueryCoverage(keyFields, includeFields, candidateQueries)
    };
  }

  async generateSpecializedIndexes(analysis) {
    console.log('Generating specialized index recommendations...');

    const specializedIndexes = [];

    // Text search indexes for string fields with pattern matching
    const textFields = this.identifyTextSearchFields(analysis);
    textFields.forEach(textField => {
      specializedIndexes.push({
        type: 'text',
        name: `idx_text_${textField.field}`,
        specification: { [textField.field]: 'text' },
        options: {
          background: true,
          default_language: 'english',
          weights: { [textField.field]: textField.weight }
        },
        reasoning: `Text search index for ${textField.field} pattern matching`,
        applicableQueries: textField.queries,
        priority: textField.priority
      });
    });

    // Geospatial indexes for location data
    const geoFields = this.identifyGeospatialFields(analysis);
    geoFields.forEach(geoField => {
      specializedIndexes.push({
        type: 'geospatial',
        name: `idx_geo_${geoField.field}`,
        specification: { [geoField.field]: '2dsphere' },
        options: {
          background: true,
          '2dsphereIndexVersion': 3
        },
        reasoning: `Geospatial index for ${geoField.field} location queries`,
        applicableQueries: geoField.queries,
        priority: geoField.priority
      });
    });

    // TTL indexes for time-based data expiration
    const ttlFields = this.identifyTTLFields(analysis);
    ttlFields.forEach(ttlField => {
      specializedIndexes.push({
        type: 'ttl',
        name: `idx_ttl_${ttlField.field}`,
        specification: { [ttlField.field]: 1 },
        options: {
          background: true,
          expireAfterSeconds: ttlField.expireAfterSeconds
        },
        reasoning: `TTL index for automatic ${ttlField.field} data expiration`,
        expirationPeriod: ttlField.expirationPeriod,
        priority: ttlField.priority
      });
    });

    // Sparse indexes for fields with many null values
    const sparseFields = this.identifySparseFields(analysis);
    sparseFields.forEach(sparseField => {
      specializedIndexes.push({
        type: 'sparse',
        name: `idx_sparse_${sparseField.field}`,
        specification: { [sparseField.field]: 1 },
        options: {
          background: true,
          sparse: true
        },
        reasoning: `Sparse index for ${sparseField.field} excluding null values`,
        nullPercentage: sparseField.nullPercentage,
        priority: sparseField.priority
      });
    });

    return specializedIndexes
      .sort((a, b) => b.priority - a.priority)
      .slice(0, Math.floor(this.performanceTargets.maxIndexesPerCollection / 2));
  }

  async executeIndexingPlan(collection, plan) {
    console.log(`Executing indexing plan for ${collection.collectionName}...`);

    const results = {
      successful: [],
      failed: [],
      skipped: [],
      totalTime: 0
    };

    const startTime = Date.now();

    for (const index of plan.recommendedIndexes) {
      try {
        console.log(`Creating index: ${index.name}`);

        // Check if index already exists
        const existingIndexes = await collection.listIndexes().toArray();
        const indexExists = existingIndexes.some(existing => existing.name === index.name);

        if (indexExists) {
          console.log(`Index ${index.name} already exists, skipping...`);
          results.skipped.push({
            name: index.name,
            reason: 'Index already exists'
          });
          continue;
        }

        // Create the index
        const indexStartTime = Date.now();
        await collection.createIndex(index.specification, {
          name: index.name,
          ...index.options
        });
        const indexCreationTime = Date.now() - indexStartTime;

        results.successful.push({
          name: index.name,
          type: index.type,
          specification: index.specification,
          creationTime: indexCreationTime,
          estimatedImpact: index.estimatedImpact
        });

        console.log(`Index ${index.name} created successfully in ${indexCreationTime}ms`);

      } catch (error) {
        console.error(`Failed to create index ${index.name}:`, error.message);
        results.failed.push({
          name: index.name,
          type: index.type,
          error: error.message,
          specification: index.specification
        });
      }
    }

    results.totalTime = Date.now() - startTime;

    console.log(`Index creation completed in ${results.totalTime}ms`);
    console.log(`Successful: ${results.successful.length}, Failed: ${results.failed.length}, Skipped: ${results.skipped.length}`);

    return results;
  }

  async validateIndexPerformance(collection, plan, queryPatterns) {
    console.log('Validating index performance with test queries...');

    const validation = {
      queries: [],
      summary: {
        totalQueries: queryPatterns.length,
        improvedQueries: 0,
        avgImprovementPct: 0,
        significantImprovements: 0
      }
    };

    for (const pattern of queryPatterns.slice(0, 20)) { // Test top 20 patterns
      try {
        // Execute query with explain to get performance metrics
        const collection_handle = this.collections[collection.collectionName] || collection;

        let cursor;
        if (pattern.aggregation) {
          cursor = collection_handle.aggregate(pattern.aggregation);
        } else {
          cursor = collection_handle.find(pattern.filter || {});
          if (pattern.sort) cursor.sort(pattern.sort);
          if (pattern.limit) cursor.limit(pattern.limit);
          if (pattern.projection) cursor.project(pattern.projection);
        }

        const explainResult = await cursor.explain('executionStats');

        const queryValidation = {
          pattern: pattern.name || 'Unnamed query',
          executionTimeMs: explainResult.executionStats?.executionTimeMillis || 0,
          totalDocsExamined: explainResult.executionStats?.totalDocsExamined || 0,
          totalDocsReturned: explainResult.executionStats?.totalDocsReturned || 0,
          indexesUsed: this.extractIndexNames(explainResult),
          efficiency: this.calculateQueryEfficiency(explainResult),
          grade: this.assignPerformanceGrade(explainResult),
          improvement: this.calculateImprovement(pattern, explainResult)
        };

        validation.queries.push(queryValidation);

        if (queryValidation.improvement > 0) {
          validation.summary.improvedQueries++;
          validation.summary.avgImprovementPct += queryValidation.improvement;
        }

        if (queryValidation.improvement > 50) {
          validation.summary.significantImprovements++;
        }

      } catch (error) {
        console.warn(`Query validation failed for pattern: ${pattern.name}`, error.message);
        validation.queries.push({
          pattern: pattern.name || 'Unnamed query',
          error: error.message,
          success: false
        });
      }
    }

    if (validation.summary.improvedQueries > 0) {
      validation.summary.avgImprovementPct /= validation.summary.improvedQueries;
    }

    console.log(`Performance validation completed: ${validation.summary.improvedQueries}/${validation.summary.totalQueries} queries improved`);
    console.log(`Average improvement: ${validation.summary.avgImprovementPct.toFixed(1)}%`);
    console.log(`Significant improvements: ${validation.summary.significantImprovements}`);

    return validation;
  }

  // Helper methods for advanced index analysis and optimization

  buildIndexSpecification(fields) {
    const spec = {};
    fields.forEach(field => {
      spec[field.field] = field.direction;
    });
    return spec;
  }

  generateIndexOptions(fields, analysis) {
    return {
      background: true,
      ...(this.shouldUsePartialFilter(fields, analysis) && {
        partialFilterExpression: this.buildOptimalPartialFilter(fields, analysis)
      })
    };
  }

  isFieldUsedInSort(field, analysis) {
    for (const [sortPattern] of analysis.sortPatterns) {
      if (sortPattern.includes(`${field}:`)) {
        return true;
      }
    }
    return false;
  }

  getSortDirection(field, analysis) {
    for (const [sortPattern] of analysis.sortPatterns) {
      const fieldPattern = sortPattern.split(',').find(pattern => pattern.startsWith(`${field}:`));
      if (fieldPattern) {
        return parseInt(fieldPattern.split(':')[1]) || 1;
      }
    }
    return 1;
  }

  calculateCompoundSelectivity(fields) {
    // Estimate compound selectivity using field independence assumption
    return fields.reduce((selectivity, field) => {
      return selectivity * (field.selectivity || 0.1);
    }, 1);
  }

  validateIndexUtility(index, analysis) {
    // Validate that index provides meaningful benefit
    const estimatedSelectivity = this.calculateCompoundSelectivity(index.fields);
    const supportedQueries = this.identifyMatchingQueries(index, analysis);

    return estimatedSelectivity < 0.5 && supportedQueries.length > 0;
  }

  identifyMatchingQueries(index, analysis) {
    // Simplified query matching logic
    const matchingQueries = [];
    const indexFields = new Set(index.fields.map(f => f.field));

    // Check field combinations that would benefit from this index
    for (const [fieldCombination, frequency] of analysis.fieldCombinations) {
      const queryFields = new Set(fieldCombination.split(','));
      const overlap = [...indexFields].filter(field => queryFields.has(field));

      if (overlap.length >= 2) { // At least 2 fields overlap
        matchingQueries.push({
          fields: fieldCombination,
          frequency: frequency,
          coverage: overlap.length / indexFields.size
        });
      }
    }

    return matchingQueries;
  }

  calculateIndexPriority(index, frequency, analysis) {
    const baseScore = frequency * 10;
    const selectivityBonus = (1 - index.estimatedSelectivity) * 50;
    const fieldCountPenalty = index.fields.length * 5;

    return Math.max(0, baseScore + selectivityBonus - fieldCountPenalty);
  }

  calculatePerformanceImprovement(plan) {
    // Simplified improvement estimation
    const baseImprovement = plan.recommendedIndexes.length * 15; // 15% per index
    const compoundBonus = plan.recommendedIndexes.filter(idx => idx.type === 'compound').length * 25;
    const partialBonus = plan.recommendedIndexes.filter(idx => idx.type === 'partial').length * 35;

    return Math.min(90, baseImprovement + compoundBonus + partialBonus);
  }

  extractIndexNames(explainResult) {
    const indexes = new Set();

    const extractFromStage = (stage) => {
      if (stage.indexName) {
        indexes.add(stage.indexName);
      }
      if (stage.inputStage) {
        extractFromStage(stage.inputStage);
      }
      if (stage.inputStages) {
        stage.inputStages.forEach(extractFromStage);
      }
    };

    if (explainResult.executionStats?.executionStages) {
      extractFromStage(explainResult.executionStats.executionStages);
    }

    return Array.from(indexes);
  }

  calculateQueryEfficiency(explainResult) {
    const stats = explainResult.executionStats;
    if (!stats) return 0;

    const examined = stats.totalDocsExamined || 0;
    const returned = stats.totalDocsReturned || 0;

    return examined > 0 ? returned / examined : 1;
  }

  assignPerformanceGrade(explainResult) {
    const efficiency = this.calculateQueryEfficiency(explainResult);
    const executionTime = explainResult.executionStats?.executionTimeMillis || 0;
    const hasIndexScan = this.extractIndexNames(explainResult).length > 0;

    let score = 0;

    // Efficiency scoring
    if (efficiency >= 0.8) score += 40;
    else if (efficiency >= 0.5) score += 30;
    else if (efficiency >= 0.2) score += 20;
    else if (efficiency >= 0.1) score += 10;

    // Execution time scoring
    if (executionTime <= 50) score += 35;
    else if (executionTime <= 100) score += 25;
    else if (executionTime <= 250) score += 15;
    else if (executionTime <= 500) score += 5;

    // Index usage scoring
    if (hasIndexScan) score += 25;

    if (score >= 85) return 'A';
    else if (score >= 70) return 'B';
    else if (score >= 50) return 'C';
    else if (score >= 30) return 'D';
    else return 'F';
  }

  calculateImprovement(pattern, explainResult) {
    // Simplified improvement calculation
    const efficiency = this.calculateQueryEfficiency(explainResult);
    const executionTime = explainResult.executionStats?.executionTimeMillis || 0;
    const hasIndexScan = this.extractIndexNames(explainResult).length > 0;

    let improvementScore = 0;

    if (hasIndexScan) improvementScore += 30;
    if (efficiency > 0.5) improvementScore += 40;
    if (executionTime < 100) improvementScore += 30;

    return Math.min(100, improvementScore);
  }

  // Additional helper methods for specialized index types

  identifyTextSearchFields(analysis) {
    const textFields = [];

    analysis.filterTypes.forEach((types, field) => {
      if (types.has('pattern_match') && 
          (field.includes('name') || field.includes('title') || field.includes('description'))) {
        textFields.push({
          field: field,
          weight: analysis.fieldUsage.get(field) || 1,
          queries: [`Text search on ${field}`],
          priority: (analysis.fieldUsage.get(field) || 0) * 10
        });
      }
    });

    return textFields;
  }

  identifyGeospatialFields(analysis) {
    const geoFields = [];

    analysis.fieldUsage.forEach((usage, field) => {
      if (field.includes('location') || field.includes('coordinates') || 
          field.includes('lat') || field.includes('lng') || field.includes('geo')) {
        geoFields.push({
          field: field,
          queries: [`Geospatial queries on ${field}`],
          priority: usage * 15
        });
      }
    });

    return geoFields;
  }

  identifyTTLFields(analysis) {
    const ttlFields = [];

    analysis.fieldUsage.forEach((usage, field) => {
      if (field.includes('expires') || field.includes('expire') || 
          field === 'createdAt' || field === 'updatedAt') {
        ttlFields.push({
          field: field,
          expireAfterSeconds: this.getExpireAfterSeconds(field),
          expirationPeriod: this.getExpirationPeriod(field),
          priority: usage * 5
        });
      }
    });

    return ttlFields;
  }

  identifySparseFields(analysis) {
    const sparseFields = [];

    // Fields that are likely to have many null values
    const potentialSparseFields = ['phone', 'middle_name', 'company', 'notes', 'optional_field'];

    analysis.fieldUsage.forEach((usage, field) => {
      if (potentialSparseFields.some(sparse => field.includes(sparse))) {
        sparseFields.push({
          field: field,
          nullPercentage: 0.6, // Estimated
          priority: usage * 8
        });
      }
    });

    return sparseFields;
  }

  getExpireAfterSeconds(field) {
    const expirationMap = {
      'session': 86400,        // 1 day
      'temp': 3600,           // 1 hour  
      'cache': 1800,          // 30 minutes
      'token': 3600,          // 1 hour
      'verification': 86400,   // 1 day
      'expires': 0            // Use field value
    };

    for (const [key, seconds] of Object.entries(expirationMap)) {
      if (field.includes(key)) {
        return seconds;
      }
    }

    return 86400; // Default 1 day
  }

  getExpirationPeriod(field) {
    const expireAfter = this.getExpireAfterSeconds(field);
    if (expireAfter >= 86400) return `${Math.floor(expireAfter / 86400)} days`;
    if (expireAfter >= 3600) return `${Math.floor(expireAfter / 3600)} hours`;
    return `${Math.floor(expireAfter / 60)} minutes`;
  }

  async estimateFieldSelectivity(analysis) {
    // Simplified selectivity estimation
    // In production, this would use actual data sampling

    analysis.fieldUsage.forEach((usage, field) => {
      let estimatedSelectivity = 0.5; // Default

      // Status/enum fields typically have low cardinality
      if (field.includes('status') || field.includes('type') || field.includes('category')) {
        estimatedSelectivity = 0.1;
      }
      // ID fields have high cardinality
      else if (field.includes('id') || field.includes('_id')) {
        estimatedSelectivity = 0.9;
      }
      // Email fields have high cardinality
      else if (field.includes('email')) {
        estimatedSelectivity = 0.8;
      }
      // Date fields vary based on range
      else if (field.includes('date') || field.includes('time')) {
        estimatedSelectivity = 0.3;
      }

      analysis.selectivityEstimates.set(field, estimatedSelectivity);
    });
  }

  identifyOptimalFieldCombinations(analysis) {
    const combinations = [];

    // Sort combinations by frequency and expected performance impact
    const sortedCombinations = Array.from(analysis.fieldCombinations.entries())
      .sort(([, a], [, b]) => b - a);

    sortedCombinations.forEach(([combination, frequency]) => {
      const fields = combination.split(',');
      const totalSelectivity = fields.reduce((product, field) => {
        return product * (analysis.selectivityEstimates.get(field) || 0.5);
      }, 1);

      combinations.push({
        fields: fields,
        frequency: frequency,
        selectivity: totalSelectivity,
        score: frequency * (1 - totalSelectivity) * 100,
        reasoning: `Combination of ${fields.length} fields with ${frequency} usage frequency`
      });
    });

    return combinations
      .sort((a, b) => b.score - a.score)
      .slice(0, 15);
  }

  generateIndexingRecommendations(analysis, optimalCombinations) {
    return {
      topFieldCombinations: optimalCombinations.slice(0, 5),
      highUsageFields: Array.from(analysis.fieldUsage.entries())
        .sort(([, a], [, b]) => b - a)
        .slice(0, 10)
        .map(([field, usage]) => ({ field, usage })),
      selectiveFields: Array.from(analysis.selectivityEstimates.entries())
        .filter(([, selectivity]) => selectivity < 0.2)
        .sort(([, a], [, b]) => a - b)
        .map(([field, selectivity]) => ({ field, selectivity })),
      commonSortPatterns: Array.from(analysis.sortPatterns.entries())
        .sort(([, a], [, b]) => b - a)
        .slice(0, 5)
        .map(([pattern, frequency]) => ({ pattern, frequency }))
    };
  }
}

// Benefits of MongoDB Advanced Indexing Strategies:
// - Comprehensive compound index design using ESR (Equality, Sort, Range) optimization patterns
// - Intelligent partial indexing for selective filtering and reduced storage overhead
// - Sophisticated covering index generation for complete query optimization
// - Specialized index support for text search, geospatial, TTL, and sparse data patterns
// - Automated index performance validation and impact measurement
// - Production-ready index creation with background processing and error handling
// - Advanced query pattern analysis and field combination optimization
// - Integration with MongoDB's native indexing capabilities and query optimizer
// - Comprehensive performance monitoring and index effectiveness tracking
// - SQL-compatible index management through QueryLeaf integration

module.exports = {
  MongoIndexOptimizer
};

Understanding MongoDB Compound Index Architecture

Advanced Index Design Patterns and Performance Optimization

Implement sophisticated compound indexing strategies for production-scale applications:

// Production-ready compound index management and optimization patterns
class ProductionIndexManager extends MongoIndexOptimizer {
  constructor(db) {
    super(db);

    this.productionConfig = {
      maxConcurrentIndexBuilds: 2,
      indexMaintenanceWindows: ['02:00-04:00'],
      performanceMonitoringInterval: 300000, // 5 minutes
      autoOptimizationEnabled: true,
      indexUsageTrackingPeriod: 86400000 // 24 hours
    };

    this.indexMetrics = new Map();
    this.optimizationQueue = [];
  }

  async implementProductionIndexingWorkflow(collections) {
    console.log('Implementing production-grade indexing workflow...');

    const workflow = {
      phase1_analysis: await this.performComprehensiveIndexAnalysis(collections),
      phase2_planning: await this.generateProductionIndexPlan(collections),
      phase3_execution: await this.executeProductionIndexPlan(collections),
      phase4_monitoring: await this.setupIndexPerformanceMonitoring(collections),
      phase5_optimization: await this.implementContinuousOptimization(collections)
    };

    return {
      workflow: workflow,
      summary: this.generateWorkflowSummary(workflow),
      monitoring: await this.setupProductionMonitoring(collections),
      maintenance: await this.scheduleIndexMaintenance(collections)
    };
  }

  async performComprehensiveIndexAnalysis(collections) {
    console.log('Performing comprehensive production index analysis...');

    const analysis = {
      collections: [],
      globalPatterns: new Map(),
      crossCollectionOptimizations: [],
      resourceImpact: {},
      riskAssessment: {}
    };

    for (const collectionName of collections) {
      const collection = this.collections[collectionName];

      // Analyze current index usage
      const indexStats = await this.analyzeCurrentIndexUsage(collection);

      // Sample query patterns from profiler
      const queryPatterns = await this.extractQueryPatternsFromProfiler(collection);

      // Analyze data distribution and selectivity
      const dataDistribution = await this.analyzeDataDistribution(collection);

      // Resource utilization analysis
      const resourceUsage = await this.analyzeIndexResourceUsage(collection);

      analysis.collections.push({
        name: collectionName,
        indexStats: indexStats,
        queryPatterns: queryPatterns,
        dataDistribution: dataDistribution,
        resourceUsage: resourceUsage,
        recommendations: await this.generateCollectionSpecificRecommendations(collection, queryPatterns, dataDistribution)
      });
    }

    // Identify global optimization opportunities
    analysis.crossCollectionOptimizations = await this.identifyCrossCollectionOptimizations(analysis.collections);

    // Assess resource impact and risks
    analysis.resourceImpact = this.assessResourceImpact(analysis.collections);
    analysis.riskAssessment = this.performIndexingRiskAssessment(analysis.collections);

    return analysis;
  }

  async analyzeCurrentIndexUsage(collection) {
    console.log(`Analyzing current index usage for ${collection.collectionName}...`);

    try {
      // Get index statistics
      const indexStats = await collection.aggregate([
        { $indexStats: {} }
      ]).toArray();

      // Get collection statistics
      const collStats = await this.db.runCommand({ collStats: collection.collectionName });

      const analysis = {
        indexes: [],
        totalIndexSize: 0,
        unusedIndexes: [],
        underutilizedIndexes: [],
        highImpactIndexes: [],
        recommendations: []
      };

      indexStats.forEach(indexStat => {
        const indexAnalysis = {
          name: indexStat.name,
          key: indexStat.key,
          accessCount: indexStat.accesses?.ops || 0,
          accessSinceLastRestart: indexStat.accesses?.since || new Date(),
          sizeBytes: indexStat.size || 0,

          // Calculate utilization metrics
          utilizationScore: this.calculateIndexUtilizationScore(indexStat),
          efficiency: this.calculateIndexEfficiency(indexStat, collStats),

          // Categorize index usage
          category: this.categorizeIndexUsage(indexStat),

          // Performance impact assessment
          impactScore: this.calculateIndexImpactScore(indexStat, collStats)
        };

        analysis.indexes.push(indexAnalysis);
        analysis.totalIndexSize += indexAnalysis.sizeBytes;

        // Categorize indexes based on usage patterns
        if (indexAnalysis.category === 'unused') {
          analysis.unusedIndexes.push(indexAnalysis);
        } else if (indexAnalysis.category === 'underutilized') {
          analysis.underutilizedIndexes.push(indexAnalysis);
        } else if (indexAnalysis.impactScore > 80) {
          analysis.highImpactIndexes.push(indexAnalysis);
        }
      });

      // Generate optimization recommendations
      analysis.recommendations = this.generateIndexOptimizationRecommendations(analysis);

      return analysis;

    } catch (error) {
      console.warn(`Failed to analyze index usage for ${collection.collectionName}:`, error.message);
      return { error: error.message };
    }
  }

  async extractQueryPatternsFromProfiler(collection) {
    console.log(`Extracting query patterns from profiler for ${collection.collectionName}...`);

    try {
      // Query the profiler collection for recent operations
      const profileData = await this.db.collection('system.profile').aggregate([
        {
          $match: {
            ns: `${this.db.databaseName}.${collection.collectionName}`,
            ts: { $gte: new Date(Date.now() - this.productionConfig.indexUsageTrackingPeriod) },
            'command.find': { $exists: true }
          }
        },
        {
          $group: {
            _id: {
              filter: '$command.filter',
              sort: '$command.sort',
              projection: '$command.projection'
            },
            count: { $sum: 1 },
            avgExecutionTime: { $avg: '$millis' },
            totalDocsExamined: { $sum: '$docsExamined' },
            totalDocsReturned: { $sum: '$nreturned' },
            indexesUsed: { $addToSet: '$planSummary' }
          }
        },
        {
          $sort: { count: -1 }
        },
        {
          $limit: 100
        }
      ]).toArray();

      const patterns = profileData.map(pattern => ({
        filter: pattern._id.filter || {},
        sort: pattern._id.sort || {},
        projection: pattern._id.projection || {},
        frequency: pattern.count,
        avgExecutionTime: pattern.avgExecutionTime,
        efficiency: pattern.totalDocsReturned / Math.max(pattern.totalDocsExamined, 1),
        indexesUsed: pattern.indexesUsed,
        priority: this.calculateQueryPatternPriority(pattern)
      }));

      return patterns.sort((a, b) => b.priority - a.priority);

    } catch (error) {
      console.warn(`Failed to extract query patterns for ${collection.collectionName}:`, error.message);
      return [];
    }
  }

  async implementAdvancedIndexMonitoring(collections) {
    console.log('Setting up advanced index performance monitoring...');

    const monitoringConfig = {
      collections: collections,
      metrics: {
        indexUtilization: true,
        queryPerformance: true,
        resourceConsumption: true,
        growthTrends: true
      },
      alerts: {
        unusedIndexes: { threshold: 0.01, period: '7d' },
        slowQueries: { threshold: 1000, period: '1h' },
        highResourceUsage: { threshold: 0.8, period: '15m' }
      },
      reporting: {
        frequency: 'daily',
        recipients: ['dba-team@company.com']
      }
    };

    // Create monitoring aggregation pipelines
    const monitoringPipelines = await this.createMonitoringPipelines(collections);

    // Setup automated alerts
    const alertSystem = await this.setupIndexAlertSystem(monitoringConfig);

    // Initialize performance tracking
    const performanceTracker = await this.initializePerformanceTracking(collections);

    return {
      config: monitoringConfig,
      pipelines: monitoringPipelines,
      alerts: alertSystem,
      tracking: performanceTracker,
      dashboard: await this.createIndexMonitoringDashboard(collections)
    };
  }

  calculateIndexUtilizationScore(indexStat) {
    const accessCount = indexStat.accesses?.ops || 0;
    const timeSinceLastRestart = Date.now() - (indexStat.accesses?.since?.getTime() || Date.now());
    const hoursRunning = timeSinceLastRestart / (1000 * 60 * 60);

    // Calculate accesses per hour
    const accessesPerHour = hoursRunning > 0 ? accessCount / hoursRunning : 0;

    // Score based on usage frequency
    if (accessesPerHour > 100) return 100;
    else if (accessesPerHour > 10) return 80;
    else if (accessesPerHour > 1) return 60;
    else if (accessesPerHour > 0.1) return 40;
    else if (accessesPerHour > 0) return 20;
    else return 0;
  }

  calculateIndexEfficiency(indexStat, collStats) {
    const indexSize = indexStat.size || 0;
    const accessCount = indexStat.accesses?.ops || 0;
    const totalCollectionSize = collStats.size || 1;

    // Efficiency based on size-to-usage ratio
    const sizeRatio = indexSize / totalCollectionSize;
    const usageEfficiency = accessCount > 0 ? Math.min(100, accessCount / sizeRatio) : 0;

    return Math.round(usageEfficiency);
  }

  categorizeIndexUsage(indexStat) {
    const utilizationScore = this.calculateIndexUtilizationScore(indexStat);

    if (utilizationScore === 0) return 'unused';
    else if (utilizationScore < 20) return 'underutilized';
    else if (utilizationScore < 60) return 'moderate';
    else if (utilizationScore < 90) return 'well_used';
    else return 'critical';
  }

  calculateIndexImpactScore(indexStat, collStats) {
    const utilizationScore = this.calculateIndexUtilizationScore(indexStat);
    const efficiency = this.calculateIndexEfficiency(indexStat, collStats);
    const sizeImpact = (indexStat.size || 0) / (collStats.size || 1) * 100;

    // Combined impact score
    return Math.round((utilizationScore * 0.5) + (efficiency * 0.3) + (sizeImpact * 0.2));
  }

  calculateQueryPatternPriority(pattern) {
    const frequencyScore = Math.min(100, pattern.count * 2);
    const performanceScore = pattern.avgExecutionTime > 100 ? 50 : 
                           pattern.avgExecutionTime > 50 ? 30 : 10;
    const efficiencyScore = pattern.efficiency > 0.8 ? 0 : 
                          pattern.efficiency > 0.5 ? 20 : 40;

    return frequencyScore + performanceScore + efficiencyScore;
  }

  generateIndexOptimizationRecommendations(analysis) {
    const recommendations = [];

    // Unused index recommendations
    analysis.unusedIndexes.forEach(index => {
      if (index.name !== '_id_') { // Never recommend removing _id_ index
        recommendations.push({
          type: 'DROP_INDEX',
          priority: 'LOW',
          index: index.name,
          reason: `Index has ${index.accessCount} accesses since last restart`,
          estimatedSavings: `${(index.sizeBytes / 1024 / 1024).toFixed(2)}MB storage`,
          risk: 'Low - unused index can be safely removed'
        });
      }
    });

    // Underutilized index recommendations
    analysis.underutilizedIndexes.forEach(index => {
      recommendations.push({
        type: 'REVIEW_INDEX',
        priority: 'MEDIUM',
        index: index.name,
        reason: `Low utilization score: ${index.utilizationScore}`,
        suggestion: 'Review query patterns to determine if index can be optimized or removed',
        risk: 'Medium - verify index necessity before removal'
      });
    });

    // High impact index recommendations
    analysis.highImpactIndexes.forEach(index => {
      recommendations.push({
        type: 'OPTIMIZE_INDEX',
        priority: 'HIGH',
        index: index.name,
        reason: `High impact index with score: ${index.impactScore}`,
        suggestion: 'Consider optimizing or creating covering index variants',
        risk: 'High - critical for query performance'
      });
    });

    return recommendations.sort((a, b) => {
      const priorityOrder = { 'HIGH': 3, 'MEDIUM': 2, 'LOW': 1 };
      return priorityOrder[b.priority] - priorityOrder[a.priority];
    });
  }
}

SQL-Style Index Management with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB index management and optimization:

-- QueryLeaf advanced indexing with SQL-familiar syntax

-- Create comprehensive compound indexes using ESR pattern optimization
CREATE INDEX idx_users_esr_optimized ON users (
  -- Equality fields first (highest selectivity)
  status,           -- Equality filter: active, premium, trial
  subscription_tier, -- Equality filter: basic, premium, enterprise

  -- Sort fields second (maintain sort order)
  created_at DESC,  -- Sort field for chronological ordering
  last_login_at DESC, -- Sort field for activity-based ordering

  -- Range fields last (lowest selectivity impact)  
  total_spent,      -- Range filter for value-based queries
  account_score     -- Range filter for scoring queries
)
WITH INDEX_OPTIONS (
  background = true,
  name = 'idx_users_comprehensive_esr',

  -- Partial filter for active users only (reduces index size by ~70%)
  partial_filter = {
    status: { $in: ['active', 'premium', 'trial'] },
    subscription_tier: { $ne: null },
    last_login_at: { $gte: DATE('2024-01-01') }
  },

  -- Optimization hints
  optimization_level = 'aggressive',
  estimated_selectivity = 0.15,
  expected_query_patterns = ['user_dashboard', 'admin_user_list', 'billing_reports']
);

-- Advanced compound index with covering capability
CREATE COVERING INDEX idx_orders_comprehensive ON orders (
  -- Key fields (used in WHERE and ORDER BY)
  user_id,          -- Join field for user lookups
  status,           -- Filter field: pending, completed, cancelled
  order_date DESC,  -- Sort field for chronological ordering

  -- Included fields (returned in SELECT without document lookup)  
  INCLUDE (
    total_amount,
    discount_amount,
    payment_method,
    shipping_address,
    product_categories,
    order_notes
  )
)
WITH INDEX_OPTIONS (
  background = true,
  name = 'idx_orders_user_status_covering',

  -- Partial filter for recent orders
  partial_filter = {
    order_date: { $gte: DATE_SUB(CURRENT_DATE, INTERVAL 2 YEAR) },
    status: { $in: ['pending', 'processing', 'completed', 'shipped'] }
  },

  covering_optimization = true,
  estimated_coverage = '85% of order queries',
  storage_overhead = 'moderate'
);

-- Specialized indexes for different query patterns
CREATE TEXT INDEX idx_products_search ON products (
  product_name,
  description,
  tags,
  category
)
WITH TEXT_OPTIONS (
  default_language = 'english',
  language_override = 'language_field',
  weights = {
    product_name: 10,
    description: 5,  
    tags: 8,
    category: 3
  },
  text_index_version = 3
);

-- Geospatial index for location-based queries
CREATE GEOSPATIAL INDEX idx_stores_location ON stores (
  location  -- GeoJSON Point field
)
WITH GEO_OPTIONS (
  index_version = '2dsphere_v3',
  coordinate_system = 'WGS84',
  sparse = true,
  background = true
);

-- TTL index for session management
CREATE TTL INDEX idx_sessions_expiry ON user_sessions (
  created_at
)
WITH TTL_OPTIONS (
  expire_after_seconds = 3600,  -- 1 hour
  background = true,
  sparse = true
);

-- Partial index for selective filtering (high-value customers only)
CREATE PARTIAL INDEX idx_users_premium ON users (
  email,
  last_login_at DESC,
  total_lifetime_value DESC
)
WHERE subscription_tier IN ('premium', 'enterprise') 
  AND total_lifetime_value > 1000
  AND status = 'active'
WITH INDEX_OPTIONS (
  background = true,
  estimated_size_reduction = '80%',
  target_queries = ['premium_customer_analysis', 'high_value_user_reports']
);

-- Multi-key index for array fields
CREATE MULTIKEY INDEX idx_orders_products ON orders (
  product_ids,      -- Array field
  order_date DESC,
  total_amount
)
WITH INDEX_OPTIONS (
  background = true,
  multikey_optimization = true,
  array_field_hints = ['product_ids']
);

-- Comprehensive index analysis and optimization query
WITH index_usage_analysis AS (
  SELECT 
    collection_name,
    index_name,
    index_key,
    index_size_mb,
    access_count,
    access_rate_per_hour,

    -- Index efficiency metrics
    ROUND((access_count::float / GREATEST(index_size_mb, 0.1))::numeric, 2) as efficiency_ratio,

    -- Usage categorization
    CASE 
      WHEN access_rate_per_hour > 100 THEN 'critical'
      WHEN access_rate_per_hour > 10 THEN 'high_usage'
      WHEN access_rate_per_hour > 1 THEN 'moderate_usage'
      WHEN access_rate_per_hour > 0.1 THEN 'low_usage'
      ELSE 'unused'
    END as usage_category,

    -- Performance impact assessment
    CASE
      WHEN access_rate_per_hour > 50 AND efficiency_ratio > 10 THEN 'high_impact'
      WHEN access_rate_per_hour > 10 AND efficiency_ratio > 5 THEN 'medium_impact'  
      WHEN access_count > 0 THEN 'low_impact'
      ELSE 'no_impact'
    END as performance_impact,

    -- Storage overhead analysis
    CASE
      WHEN index_size_mb > 1000 THEN 'very_large'
      WHEN index_size_mb > 100 THEN 'large'
      WHEN index_size_mb > 10 THEN 'medium'
      ELSE 'small'
    END as storage_overhead

  FROM index_statistics
  WHERE collection_name IN ('users', 'orders', 'products', 'sessions')
),

query_pattern_analysis AS (
  SELECT 
    collection_name,
    query_shape,
    query_frequency,
    avg_execution_time_ms,
    avg_docs_examined,
    avg_docs_returned,

    -- Query efficiency metrics
    avg_docs_returned::float / GREATEST(avg_docs_examined, 1) as query_efficiency,

    -- Performance classification
    CASE
      WHEN avg_execution_time_ms > 1000 THEN 'slow'
      WHEN avg_execution_time_ms > 100 THEN 'moderate'  
      ELSE 'fast'
    END as performance_category,

    -- Index usage effectiveness
    CASE
      WHEN index_hit_rate > 0.9 THEN 'excellent_index_usage'
      WHEN index_hit_rate > 0.7 THEN 'good_index_usage'
      WHEN index_hit_rate > 0.5 THEN 'fair_index_usage'
      ELSE 'poor_index_usage'
    END as index_effectiveness

  FROM query_performance_log
  WHERE execution_timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days'
    AND query_frequency >= 10  -- Filter low-frequency queries
),

index_optimization_recommendations AS (
  SELECT 
    iu.collection_name,
    iu.index_name,
    iu.usage_category,
    iu.performance_impact,
    iu.storage_overhead,
    iu.efficiency_ratio,

    -- Optimization recommendations based on usage patterns
    CASE 
      WHEN iu.usage_category = 'unused' AND iu.index_name != '_id_' THEN 
        'DROP - Index is unused and consuming storage'
      WHEN iu.usage_category = 'low_usage' AND iu.efficiency_ratio < 1 THEN
        'REVIEW - Low usage and poor efficiency, consider dropping'
      WHEN iu.performance_impact = 'high_impact' AND iu.storage_overhead = 'very_large' THEN
        'OPTIMIZE - Consider partial index or covering index alternative'  
      WHEN iu.usage_category = 'critical' AND qp.performance_category = 'slow' THEN
        'ENHANCE - Critical index supporting slow queries, needs optimization'
      WHEN iu.efficiency_ratio > 50 AND iu.performance_impact = 'high_impact' THEN
        'MAINTAIN - Well-performing index, continue monitoring'
      ELSE 'MONITOR - Acceptable performance, regular monitoring recommended'
    END as recommendation,

    -- Priority calculation
    CASE 
      WHEN iu.performance_impact = 'high_impact' AND qp.performance_category = 'slow' THEN 'CRITICAL'
      WHEN iu.usage_category = 'unused' AND iu.storage_overhead = 'very_large' THEN 'HIGH'
      WHEN iu.efficiency_ratio < 1 AND iu.storage_overhead IN ('large', 'very_large') THEN 'MEDIUM'
      ELSE 'LOW'
    END as priority,

    -- Estimated impact
    CASE
      WHEN iu.usage_category = 'unused' THEN 
        CONCAT('Storage savings: ', iu.index_size_mb, 'MB')
      WHEN iu.performance_impact = 'high_impact' THEN
        CONCAT('Query performance: ', ROUND(qp.avg_execution_time_ms * 0.3), 'ms reduction potential')
      ELSE 'Minimal impact expected'
    END as estimated_impact

  FROM index_usage_analysis iu
  LEFT JOIN query_pattern_analysis qp ON iu.collection_name = qp.collection_name
)

SELECT 
  collection_name,
  index_name,
  usage_category,
  performance_impact,
  recommendation,
  priority,
  estimated_impact,

  -- Action items
  CASE priority
    WHEN 'CRITICAL' THEN 'Immediate action required - review within 24 hours'
    WHEN 'HIGH' THEN 'Schedule optimization within 1 week'
    WHEN 'MEDIUM' THEN 'Include in next maintenance window'
    ELSE 'Monitor and review quarterly'
  END as action_timeline,

  -- Technical implementation guidance
  CASE 
    WHEN recommendation LIKE 'DROP%' THEN 
      CONCAT('Execute: DROP INDEX ', collection_name, '.', index_name)
    WHEN recommendation LIKE 'OPTIMIZE%' THEN
      'Analyze query patterns and create optimized compound index'
    WHEN recommendation LIKE 'ENHANCE%' THEN
      'Review index field order and consider covering index'
    ELSE 'Continue current monitoring procedures'
  END as implementation_guidance

FROM index_optimization_recommendations
WHERE priority IN ('CRITICAL', 'HIGH', 'MEDIUM')
ORDER BY 
  CASE priority WHEN 'CRITICAL' THEN 1 WHEN 'HIGH' THEN 2 WHEN 'MEDIUM' THEN 3 ELSE 4 END,
  collection_name,
  index_name;

-- Real-time index performance monitoring
CREATE MATERIALIZED VIEW index_performance_dashboard AS
WITH real_time_metrics AS (
  SELECT 
    collection_name,
    index_name,
    DATE_TRUNC('minute', access_timestamp) as minute_bucket,

    -- Real-time utilization metrics
    COUNT(*) as accesses_per_minute,
    AVG(query_execution_time_ms) as avg_query_time,
    SUM(docs_examined) as total_docs_examined,
    SUM(docs_returned) as total_docs_returned,

    -- Index efficiency in real-time
    SUM(docs_returned)::float / GREATEST(SUM(docs_examined), 1) as real_time_efficiency,

    -- Performance trends
    LAG(COUNT(*)) OVER (
      PARTITION BY collection_name, index_name 
      ORDER BY DATE_TRUNC('minute', access_timestamp)
    ) as prev_minute_accesses,

    LAG(AVG(query_execution_time_ms)) OVER (
      PARTITION BY collection_name, index_name
      ORDER BY DATE_TRUNC('minute', access_timestamp)  
    ) as prev_minute_avg_time

  FROM index_access_log
  WHERE access_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
  GROUP BY collection_name, index_name, DATE_TRUNC('minute', access_timestamp)
),

performance_alerts AS (
  SELECT 
    collection_name,
    index_name,
    minute_bucket,
    accesses_per_minute,
    avg_query_time,
    real_time_efficiency,

    -- Performance change indicators
    CASE 
      WHEN prev_minute_accesses IS NOT NULL THEN
        ((accesses_per_minute - prev_minute_accesses)::float / prev_minute_accesses * 100)
      ELSE 0
    END as access_rate_change_pct,

    CASE
      WHEN prev_minute_avg_time IS NOT NULL THEN
        ((avg_query_time - prev_minute_avg_time)::float / prev_minute_avg_time * 100) 
      ELSE 0
    END as latency_change_pct,

    -- Alert conditions
    CASE
      WHEN avg_query_time > 1000 THEN 'HIGH_LATENCY_ALERT'
      WHEN real_time_efficiency < 0.1 THEN 'LOW_EFFICIENCY_ALERT'
      WHEN accesses_per_minute > 1000 THEN 'HIGH_LOAD_ALERT'
      WHEN prev_minute_accesses IS NOT NULL AND 
           accesses_per_minute > prev_minute_accesses * 5 THEN 'LOAD_SPIKE_ALERT'
      ELSE 'NORMAL'
    END as alert_status,

    -- Optimization suggestions
    CASE
      WHEN avg_query_time > 1000 AND real_time_efficiency < 0.2 THEN 
        'Consider index redesign or query optimization'
      WHEN accesses_per_minute > 500 AND real_time_efficiency > 0.8 THEN
        'High-performing index under load - monitor for scaling needs'
      WHEN real_time_efficiency < 0.1 THEN
        'Poor selectivity - review partial index opportunities'
      ELSE 'Performance within acceptable parameters'
    END as optimization_suggestion

  FROM real_time_metrics
  WHERE minute_bucket >= CURRENT_TIMESTAMP - INTERVAL '15 minutes'
)

SELECT 
  collection_name,
  index_name,
  ROUND(AVG(accesses_per_minute)::numeric, 1) as avg_accesses_per_minute,
  ROUND(AVG(avg_query_time)::numeric, 2) as avg_latency_ms,
  ROUND(AVG(real_time_efficiency)::numeric, 3) as avg_efficiency,
  ROUND(AVG(access_rate_change_pct)::numeric, 1) as avg_load_change_pct,
  ROUND(AVG(latency_change_pct)::numeric, 1) as avg_latency_change_pct,

  -- Alert summary
  COUNT(*) FILTER (WHERE alert_status != 'NORMAL') as alert_count,
  STRING_AGG(DISTINCT alert_status, ', ') FILTER (WHERE alert_status != 'NORMAL') as active_alerts,
  MODE() WITHIN GROUP (ORDER BY optimization_suggestion) as primary_recommendation,

  -- Performance status
  CASE 
    WHEN COUNT(*) FILTER (WHERE alert_status LIKE '%HIGH%') > 0 THEN 'ATTENTION_REQUIRED'
    WHEN AVG(real_time_efficiency) > 0.7 AND AVG(avg_query_time) < 100 THEN 'OPTIMAL'
    WHEN AVG(real_time_efficiency) > 0.5 AND AVG(avg_query_time) < 250 THEN 'GOOD'  
    ELSE 'NEEDS_OPTIMIZATION'
  END as overall_status

FROM performance_alerts
GROUP BY collection_name, index_name
ORDER BY 
  CASE overall_status 
    WHEN 'ATTENTION_REQUIRED' THEN 1 
    WHEN 'NEEDS_OPTIMIZATION' THEN 2
    WHEN 'GOOD' THEN 3
    WHEN 'OPTIMAL' THEN 4
  END,
  avg_accesses_per_minute DESC;

-- QueryLeaf provides comprehensive indexing capabilities:
-- 1. SQL-familiar syntax for complex MongoDB index creation and management
-- 2. Advanced compound index design with ESR pattern optimization
-- 3. Partial and covering index support for storage and performance optimization
-- 4. Specialized index types: text, geospatial, TTL, sparse, and multikey indexes
-- 5. Real-time index performance monitoring and alerting
-- 6. Automated optimization recommendations based on usage patterns
-- 7. Production-ready index management with background creation and maintenance
-- 8. Comprehensive index analysis and resource utilization tracking
-- 9. Cross-collection optimization opportunities identification  
-- 10. Integration with MongoDB's native indexing capabilities and query optimizer

Best Practices for Production Index Management

Index Design Strategy

Essential principles for effective MongoDB index design and management:

  1. ESR Pattern Application: Design compound indexes following Equality, Sort, Range field ordering for optimal performance
  2. Selective Filtering: Use partial indexes for selective data filtering to reduce storage overhead and improve performance
  3. Covering Index Design: Create covering indexes for frequently accessed query patterns to eliminate document retrieval
  4. Index Consolidation: Minimize index count by designing compound indexes that support multiple query patterns
  5. Performance Monitoring: Implement comprehensive index utilization monitoring and automated optimization
  6. Maintenance Planning: Schedule regular index maintenance and optimization during low-traffic periods

Production Optimization Workflow

Optimize MongoDB indexes systematically for production environments:

  1. Usage Analysis: Analyze actual index usage patterns using database profiler and index statistics
  2. Query Pattern Recognition: Identify common query patterns and optimize indexes for primary use cases
  3. Performance Validation: Validate index performance improvements with comprehensive testing
  4. Resource Management: Balance query performance with storage overhead and maintenance costs
  5. Continuous Monitoring: Implement ongoing performance monitoring and automated alert systems
  6. Iterative Optimization: Regularly review and refine indexing strategies based on evolving query patterns

Conclusion

MongoDB's advanced indexing capabilities provide comprehensive tools for optimizing database performance through sophisticated compound indexes, partial filtering, covering indexes, and specialized index types. The flexible indexing architecture enables developers to design highly optimized indexes that support complex query patterns while minimizing storage overhead and maintenance costs.

Key MongoDB Advanced Indexing benefits include:

  • Comprehensive Index Types: Support for compound, partial, covering, text, geospatial, TTL, and sparse indexes
  • ESR Pattern Optimization: Systematic compound index design following proven optimization patterns
  • Performance Intelligence: Advanced index utilization analysis and automated optimization recommendations
  • Production-Ready Management: Sophisticated index creation, maintenance, and monitoring capabilities
  • Resource Optimization: Intelligent index design that balances performance with storage efficiency
  • Query Pattern Adaptation: Flexible indexing strategies that adapt to evolving application requirements

Whether you're optimizing existing applications, designing new database schemas, or implementing production indexing strategies, MongoDB's advanced indexing capabilities with QueryLeaf's familiar SQL interface provide the foundation for high-performance database operations.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB indexing strategies while providing SQL-familiar syntax for index creation, analysis, and optimization. Advanced indexing patterns, performance monitoring capabilities, and production-ready index management are seamlessly handled through familiar SQL constructs, making sophisticated database optimization both powerful and accessible to SQL-oriented development teams.

The combination of MongoDB's flexible indexing architecture with SQL-style index management makes it an ideal platform for applications requiring both high-performance queries and familiar database optimization patterns, ensuring your applications achieve optimal performance while remaining maintainable and scalable as they grow.

MongoDB Replica Sets and High Availability: Advanced Disaster Recovery and Fault Tolerance Strategies for Mission-Critical Applications

Mission-critical applications require database infrastructure that can withstand hardware failures, network outages, and data center disasters while maintaining continuous availability and data consistency. Traditional database replication approaches often introduce complexity, performance overhead, and operational challenges that become increasingly problematic as application scale and reliability requirements grow.

MongoDB's replica set architecture provides sophisticated high availability and disaster recovery capabilities that eliminate single points of failure while maintaining strong data consistency and automatic failover functionality. Unlike traditional master-slave replication systems with manual failover processes, MongoDB replica sets offer self-healing infrastructure with intelligent election algorithms, configurable read preferences, and comprehensive disaster recovery features that ensure business continuity even during catastrophic failures.

The Traditional Database Replication Challenge

Conventional database replication systems have significant limitations for high-availability requirements:

-- Traditional PostgreSQL streaming replication - manual failover and limited flexibility

-- Primary server configuration (postgresql.conf)
wal_level = replica
max_wal_senders = 3
wal_keep_segments = 64
archive_mode = on
archive_command = 'cp %p /var/lib/postgresql/wal_archive/%f'

-- Standby server configuration (recovery.conf)  
standby_mode = 'on'
primary_conninfo = 'host=primary-server port=5432 user=replicator'
restore_command = 'cp /var/lib/postgresql/wal_archive/%f %p'
trigger_file = '/tmp/postgresql.trigger.5432'

-- Manual failover process (complex and error-prone)
-- 1. Detect primary failure through monitoring
SELECT pg_is_in_recovery(); -- Check if server is in standby mode

-- 2. Promote standby to primary (manual intervention required)
-- Touch trigger file on standby server
-- $ touch /tmp/postgresql.trigger.5432

-- 3. Redirect application traffic (requires external load balancer configuration)
-- Update DNS/load balancer to point to new primary
-- Verify all applications can connect to new primary

-- 4. Reconfigure remaining servers (manual process)
-- Update primary_conninfo on other standby servers
-- Restart PostgreSQL services with new configuration

-- Complex query for checking replication lag
WITH replication_status AS (
  SELECT 
    client_addr,
    client_hostname,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    write_lag,
    flush_lag,
    replay_lag,
    sync_priority,
    sync_state,

    -- Calculate replication delay in bytes
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as replay_delay_bytes,

    -- Check if standby is healthy
    CASE 
      WHEN state = 'streaming' AND pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) < 16777216 THEN 'healthy'
      WHEN state = 'streaming' AND pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) < 134217728 THEN 'lagging'
      WHEN state = 'streaming' THEN 'severely_lagging'
      ELSE 'disconnected'
    END as health_status,

    -- Estimate recovery time if primary fails
    CASE 
      WHEN replay_lag IS NOT NULL THEN 
        EXTRACT(EPOCH FROM replay_lag)::int
      ELSE 
        GREATEST(
          EXTRACT(EPOCH FROM flush_lag)::int,
          pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / 16777216 * 10
        )
    END as estimated_recovery_seconds

  FROM pg_stat_replication
  WHERE state IS NOT NULL
),

connection_health AS (
  SELECT 
    datname,
    usename,
    client_addr,
    state,
    query,
    state_change,

    -- Connection duration
    EXTRACT(EPOCH FROM (now() - backend_start))::int as connection_age_seconds,

    -- Query duration  
    CASE 
      WHEN state = 'active' THEN EXTRACT(EPOCH FROM (now() - query_start))::int
      ELSE 0
    END as active_query_duration_seconds,

    -- Identify potentially problematic connections
    CASE
      WHEN state = 'idle in transaction' AND (now() - state_change) > interval '5 minutes' THEN 'long_idle_transaction'
      WHEN state = 'active' AND (now() - query_start) > interval '10 minutes' THEN 'long_running_query'
      WHEN backend_type = 'walsender' THEN 'replication_connection'
      ELSE 'normal'
    END as connection_type

  FROM pg_stat_activity
  WHERE backend_type IN ('client backend', 'walsender')
    AND datname IS NOT NULL
)

-- Comprehensive replication monitoring query
SELECT 
  rs.client_addr as standby_server,
  rs.client_hostname as standby_hostname,
  rs.state as replication_state,
  rs.health_status,

  -- Lag information
  COALESCE(EXTRACT(EPOCH FROM rs.replay_lag)::int, 0) as replay_lag_seconds,
  ROUND(rs.replay_delay_bytes / 1048576.0, 2) as replay_delay_mb,
  rs.estimated_recovery_seconds,

  -- Sync configuration
  rs.sync_priority,
  rs.sync_state,

  -- Connection health
  ch.connection_age_seconds,
  ch.active_query_duration_seconds,

  -- Health assessment
  CASE 
    WHEN rs.health_status = 'healthy' AND rs.sync_state = 'sync' THEN 'excellent'
    WHEN rs.health_status = 'healthy' AND rs.sync_state = 'async' THEN 'good'
    WHEN rs.health_status = 'lagging' THEN 'warning'
    WHEN rs.health_status = 'severely_lagging' THEN 'critical'
    ELSE 'unknown'
  END as overall_health,

  -- Failover readiness
  CASE
    WHEN rs.health_status = 'healthy' AND rs.estimated_recovery_seconds < 30 THEN 'ready'
    WHEN rs.health_status IN ('healthy', 'lagging') AND rs.estimated_recovery_seconds < 120 THEN 'acceptable'
    ELSE 'not_ready'
  END as failover_readiness,

  -- Recommendations
  CASE
    WHEN rs.health_status = 'disconnected' THEN 'Check network connectivity and standby server status'
    WHEN rs.health_status = 'severely_lagging' THEN 'Investigate standby performance and network bandwidth'
    WHEN rs.replay_delay_bytes > 134217728 THEN 'Consider increasing wal_keep_segments or using replication slots'
    WHEN rs.sync_state != 'sync' AND rs.sync_priority > 0 THEN 'Review synchronous_standby_names configuration'
    ELSE 'Replication operating normally'
  END as recommendation

FROM replication_status rs
LEFT JOIN connection_health ch ON rs.client_addr = ch.client_addr 
                                AND ch.connection_type = 'replication_connection'
ORDER BY rs.sync_priority DESC, rs.replay_delay_bytes ASC;

-- Problems with traditional PostgreSQL replication:
-- 1. Manual failover process requiring human intervention and expertise
-- 2. Complex configuration management across multiple servers
-- 3. Limited built-in monitoring and health checking capabilities
-- 4. Potential for data loss during failover if not configured properly
-- 5. Application-level connection management complexity
-- 6. No automatic discovery of new primary after failover
-- 7. Split-brain scenarios possible without proper fencing mechanisms
-- 8. Limited geographic distribution capabilities for disaster recovery
-- 9. Difficulty in adding/removing replica servers without downtime
-- 10. Complex backup and point-in-time recovery coordination across replicas

-- Additional monitoring complexity
-- Check for replication slots to prevent WAL accumulation
SELECT 
  slot_name,
  plugin,
  slot_type,
  datoid,
  database,
  temporary,
  active,
  active_pid,
  xmin,
  catalog_xmin,
  restart_lsn,
  confirmed_flush_lsn,

  -- Calculate slot lag
  pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) as slot_lag_bytes,

  -- Check if slot is causing WAL retention
  CASE 
    WHEN active = false THEN 'inactive_slot'
    WHEN pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 1073741824 THEN 'excessive_lag'
    ELSE 'healthy'
  END as slot_status

FROM pg_replication_slots
ORDER BY slot_lag_bytes DESC;

-- MySQL replication (even more limited)
-- Master configuration
log-bin=mysql-bin
server-id=1
binlog-format=ROW
sync-binlog=1
innodb-flush-log-at-trx-commit=1

-- Slave configuration  
server-id=2
relay-log=mysql-relay
read-only=1

-- Basic replication status (limited information)
SHOW SLAVE STATUS\G

-- Manual failover process (basic and risky)
STOP SLAVE;
RESET SLAVE ALL;
-- Manually change master configuration

-- MySQL replication limitations:
-- - Even more manual failover process
-- - Limited monitoring and diagnostics
-- - Poor handling of network partitions
-- - Basic conflict resolution
-- - Limited geographic replication support
-- - Minimal built-in health checking
-- - Simple master-slave topology only

MongoDB provides comprehensive high availability through replica sets:

// MongoDB Replica Sets - automatic failover with advanced high availability features
const { MongoClient } = require('mongodb');

// Advanced MongoDB Replica Set Management and High Availability System
class MongoReplicaSetManager {
  constructor(connectionString) {
    this.connectionString = connectionString;
    this.client = null;
    this.db = null;

    // High availability configuration
    this.replicaSetConfig = {
      members: [],
      settings: {
        chainingAllowed: true,
        heartbeatIntervalMillis: 2000,
        heartbeatTimeoutSecs: 10,
        electionTimeoutMillis: 10000,
        catchUpTimeoutMillis: 60000,
        getLastErrorModes: {},
        getLastErrorDefaults: { w: 1, wtimeout: 0 }
      }
    };

    this.healthMetrics = new Map();
    this.failoverHistory = [];
    this.performanceTargets = {
      maxReplicationLagSeconds: 10,
      maxElectionTimeSeconds: 30,
      minHealthyMembers: 2
    };
  }

  async initializeReplicaSet(members, options = {}) {
    console.log('Initializing MongoDB replica set with advanced high availability...');

    const {
      replicaSetName = 'rs0',
      priority = { primary: 1, secondary: 0.5, arbiter: 0 },
      tags = {},
      writeConcern = { w: 'majority', j: true },
      readPreference = 'primaryPreferred'
    } = options;

    try {
      // Connect to the primary candidate
      this.client = new MongoClient(this.connectionString, {
        useNewUrlParser: true,
        useUnifiedTopology: true,
        replicaSet: replicaSetName,
        readPreference: readPreference,
        writeConcern: writeConcern,
        maxPoolSize: 10,
        serverSelectionTimeoutMS: 5000,
        socketTimeoutMS: 45000,
        heartbeatFrequencyMS: 10000,
        retryWrites: true,
        retryReads: true
      });

      await this.client.connect();
      this.db = this.client.db('admin');

      // Build replica set configuration
      const replicaSetConfig = {
        _id: replicaSetName,
        version: 1,
        members: members.map((member, index) => ({
          _id: index,
          host: member.host,
          priority: member.priority || priority[member.type] || 1,
          votes: member.type === 'arbiter' ? 1 : 1,
          arbiterOnly: member.type === 'arbiter',
          buildIndexes: member.type !== 'arbiter',
          hidden: member.hidden || false,
          slaveDelay: member.slaveDelay || 0,
          tags: { ...tags[member.type], region: member.region, datacenter: member.datacenter }
        })),
        settings: {
          chainingAllowed: true,
          heartbeatIntervalMillis: 2000,
          heartbeatTimeoutSecs: 10,
          electionTimeoutMillis: 10000,
          catchUpTimeoutMillis: 60000,

          // Advanced write concern configurations
          getLastErrorModes: {
            multiDataCenter: { datacenter: 2 },
            majority: { region: 2 }
          },
          getLastErrorDefaults: { 
            w: 'majority', 
            j: true,
            wtimeout: 10000 
          }
        }
      };

      // Initialize replica set
      const initResult = await this.db.runCommand({
        replSetInitiate: replicaSetConfig
      });

      if (initResult.ok === 1) {
        console.log('Replica set initialized successfully');

        // Wait for primary election
        await this.waitForPrimaryElection();

        // Perform initial health check
        const healthStatus = await this.performHealthCheck();

        // Setup monitoring
        await this.setupAdvancedMonitoring();

        console.log('Replica set ready for high availability operations');
        return {
          success: true,
          replicaSetName: replicaSetName,
          members: members,
          healthStatus: healthStatus
        };
      } else {
        throw new Error(`Replica set initialization failed: ${initResult.errmsg}`);
      }

    } catch (error) {
      console.error('Replica set initialization error:', error);
      return {
        success: false,
        error: error.message
      };
    }
  }

  async performComprehensiveHealthCheck() {
    console.log('Performing comprehensive replica set health assessment...');

    const healthReport = {
      timestamp: new Date(),
      replicaSetStatus: null,
      memberHealth: [],
      replicationLag: {},
      electionMetrics: {},
      networkConnectivity: {},
      performanceMetrics: {},
      alerts: [],
      recommendations: []
    };

    try {
      // Get replica set status
      const rsStatus = await this.db.runCommand({ replSetGetStatus: 1 });
      healthReport.replicaSetStatus = {
        name: rsStatus.set,
        primary: rsStatus.members.find(m => m.state === 1)?.name,
        memberCount: rsStatus.members.length,
        healthyMembers: rsStatus.members.filter(m => [1, 2, 7].includes(m.state)).length,
        state: rsStatus.myState
      };

      // Analyze each member
      for (const member of rsStatus.members) {
        const memberHealth = {
          name: member.name,
          state: member.state,
          stateStr: member.stateStr,
          health: member.health,
          uptime: member.uptime,
          lastHeartbeat: member.lastHeartbeat,
          lastHeartbeatRecv: member.lastHeartbeatRecv,
          pingMs: member.pingMs,
          syncSourceHost: member.syncingTo,

          // Calculate replication lag
          replicationLag: member.optimeDate && rsStatus.date ? 
            (rsStatus.date - member.optimeDate) / 1000 : null,

          // Member status assessment
          status: this.assessMemberStatus(member),

          // Performance metrics
          performanceMetrics: {
            heartbeatLatency: member.pingMs,
            connectionHealth: member.health === 1 ? 'healthy' : 'unhealthy',
            stateStability: this.assessStateStability(member)
          }
        };

        healthReport.memberHealth.push(memberHealth);

        // Track replication lag
        if (memberHealth.replicationLag !== null) {
          healthReport.replicationLag[member.name] = memberHealth.replicationLag;
        }
      }

      // Analyze election metrics
      healthReport.electionMetrics = await this.analyzeElectionMetrics(rsStatus);

      // Check network connectivity
      healthReport.networkConnectivity = await this.checkNetworkConnectivity(rsStatus.members);

      // Generate alerts based on thresholds
      healthReport.alerts = this.generateHealthAlerts(healthReport);

      // Generate recommendations
      healthReport.recommendations = this.generateHealthRecommendations(healthReport);

      console.log(`Health check completed: ${healthReport.memberHealth.length} members analyzed`);
      console.log(`Healthy members: ${healthReport.replicaSetStatus.healthyMembers}/${healthReport.replicaSetStatus.memberCount}`);
      console.log(`Alerts generated: ${healthReport.alerts.length}`);

      return healthReport;

    } catch (error) {
      console.error('Health check failed:', error);
      healthReport.error = error.message;
      return healthReport;
    }
  }

  assessMemberStatus(member) {
    const status = {
      overall: 'unknown',
      issues: [],
      strengths: []
    };

    // State-based assessment
    switch (member.state) {
      case 1: // PRIMARY
        status.overall = 'primary';
        status.strengths.push('Acting as primary, accepting writes');
        break;
      case 2: // SECONDARY
        status.overall = 'healthy';
        status.strengths.push('Healthy secondary, replicating data');
        if (member.optimeDate && Date.now() - member.optimeDate > 30000) {
          status.issues.push('Replication lag exceeds 30 seconds');
          status.overall = 'lagging';
        }
        break;
      case 3: // RECOVERING
        status.overall = 'recovering';
        status.issues.push('Member is in recovery state');
        break;
      case 5: // STARTUP2
        status.overall = 'starting';
        status.issues.push('Member is in startup phase');
        break;
      case 6: // UNKNOWN
        status.overall = 'unknown';
        status.issues.push('Member state is unknown');
        break;
      case 7: // ARBITER
        status.overall = 'arbiter';
        status.strengths.push('Functioning arbiter for elections');
        break;
      case 8: // DOWN
        status.overall = 'down';
        status.issues.push('Member is down or unreachable');
        break;
      case 9: // ROLLBACK
        status.overall = 'rollback';
        status.issues.push('Member is performing rollback');
        break;
      case 10: // REMOVED
        status.overall = 'removed';
        status.issues.push('Member has been removed from replica set');
        break;
      default:
        status.overall = 'unknown';
        status.issues.push(`Unexpected state: ${member.state}`);
    }

    // Health-based assessment
    if (member.health !== 1) {
      status.issues.push('Member health check failing');
      if (status.overall === 'healthy') {
        status.overall = 'unhealthy';
      }
    }

    // Network latency assessment
    if (member.pingMs && member.pingMs > 100) {
      status.issues.push(`High network latency: ${member.pingMs}ms`);
    } else if (member.pingMs && member.pingMs < 10) {
      status.strengths.push(`Low network latency: ${member.pingMs}ms`);
    }

    return status;
  }

  async implementAutomaticFailoverTesting() {
    console.log('Implementing automatic failover testing and validation...');

    const failoverTest = {
      testId: require('crypto').randomUUID(),
      timestamp: new Date(),
      phases: [],
      results: {
        success: false,
        totalTimeMs: 0,
        electionTimeMs: 0,
        dataConsistencyVerified: false,
        applicationConnectivityRestored: false
      }
    };

    try {
      // Phase 1: Pre-failover health check
      console.log('Phase 1: Pre-failover health assessment...');
      const preFailoverHealth = await this.performComprehensiveHealthCheck();
      failoverTest.phases.push({
        phase: 'pre_failover_health',
        timestamp: new Date(),
        status: 'completed',
        data: preFailoverHealth
      });

      if (preFailoverHealth.replicaSetStatus.healthyMembers < this.performanceTargets.minHealthyMembers + 1) {
        throw new Error('Insufficient healthy members for safe failover testing');
      }

      // Phase 2: Insert test data for consistency verification
      console.log('Phase 2: Inserting test data for consistency verification...');
      const testCollection = this.client.db('failover_test').collection('consistency_check');
      const testDocuments = Array.from({ length: 100 }, (_, i) => ({
        _id: `failover_test_${failoverTest.testId}_${i}`,
        timestamp: new Date(),
        sequenceNumber: i,
        testData: `Failover test data ${i}`,
        checksum: require('crypto').createHash('md5').update(`test_${i}`).digest('hex')
      }));

      await testCollection.insertMany(testDocuments, { writeConcern: { w: 'majority', j: true } });
      failoverTest.phases.push({
        phase: 'test_data_insertion',
        timestamp: new Date(),
        status: 'completed',
        data: { documentsInserted: testDocuments.length }
      });

      // Phase 3: Simulate primary failure (step down primary)
      console.log('Phase 3: Simulating primary failure...');
      const startTime = Date.now();

      await this.db.runCommand({ replSetStepDown: 60, force: true });

      failoverTest.phases.push({
        phase: 'primary_step_down',
        timestamp: new Date(),
        status: 'completed',
        data: { stepDownInitiated: true }
      });

      // Phase 4: Wait for new primary election
      console.log('Phase 4: Waiting for new primary election...');
      const electionStartTime = Date.now();

      const newPrimary = await this.waitForPrimaryElection(30000); // 30 second timeout
      const electionEndTime = Date.now();

      failoverTest.results.electionTimeMs = electionEndTime - electionStartTime;

      failoverTest.phases.push({
        phase: 'primary_election',
        timestamp: new Date(),
        status: 'completed',
        data: { 
          newPrimary: newPrimary,
          electionTimeMs: failoverTest.results.electionTimeMs
        }
      });

      // Phase 5: Verify data consistency
      console.log('Phase 5: Verifying data consistency...');

      // Reconnect to new primary
      await this.client.close();
      this.client = new MongoClient(this.connectionString, {
        useNewUrlParser: true,
        useUnifiedTopology: true,
        readPreference: 'primary'
      });
      await this.client.connect();

      const verificationCollection = this.client.db('failover_test').collection('consistency_check');
      const retrievedDocs = await verificationCollection.find({
        _id: { $regex: `^failover_test_${failoverTest.testId}_` }
      }).toArray();

      const consistencyCheck = {
        expectedCount: testDocuments.length,
        retrievedCount: retrievedDocs.length,
        dataIntegrityVerified: true,
        checksumMatches: 0
      };

      // Verify checksums
      for (const doc of retrievedDocs) {
        const expectedChecksum = require('crypto').createHash('md5')
          .update(`test_${doc.sequenceNumber}`).digest('hex');
        if (doc.checksum === expectedChecksum) {
          consistencyCheck.checksumMatches++;
        }
      }

      consistencyCheck.dataIntegrityVerified = 
        consistencyCheck.expectedCount === consistencyCheck.retrievedCount &&
        consistencyCheck.checksumMatches === consistencyCheck.expectedCount;

      failoverTest.results.dataConsistencyVerified = consistencyCheck.dataIntegrityVerified;

      failoverTest.phases.push({
        phase: 'data_consistency_verification',
        timestamp: new Date(),
        status: 'completed',
        data: consistencyCheck
      });

      // Phase 6: Test application connectivity
      console.log('Phase 6: Testing application connectivity...');

      try {
        // Simulate application operations
        await verificationCollection.insertOne({
          _id: `post_failover_${failoverTest.testId}`,
          timestamp: new Date(),
          message: 'Post-failover connectivity test'
        }, { writeConcern: { w: 'majority' } });

        const postFailoverDoc = await verificationCollection.findOne({
          _id: `post_failover_${failoverTest.testId}`
        });

        failoverTest.results.applicationConnectivityRestored = postFailoverDoc !== null;

      } catch (error) {
        console.error('Application connectivity test failed:', error);
        failoverTest.results.applicationConnectivityRestored = false;
      }

      failoverTest.phases.push({
        phase: 'application_connectivity_test',
        timestamp: new Date(),
        status: failoverTest.results.applicationConnectivityRestored ? 'completed' : 'failed',
        data: { connectivityRestored: failoverTest.results.applicationConnectivityRestored }
      });

      // Phase 7: Post-failover health check
      console.log('Phase 7: Post-failover health assessment...');
      const postFailoverHealth = await this.performComprehensiveHealthCheck();
      failoverTest.phases.push({
        phase: 'post_failover_health',
        timestamp: new Date(),
        status: 'completed',
        data: postFailoverHealth
      });

      // Calculate total test time
      failoverTest.results.totalTimeMs = Date.now() - startTime;

      // Determine overall success
      failoverTest.results.success = 
        failoverTest.results.electionTimeMs <= (this.performanceTargets.maxElectionTimeSeconds * 1000) &&
        failoverTest.results.dataConsistencyVerified &&
        failoverTest.results.applicationConnectivityRestored &&
        postFailoverHealth.replicaSetStatus.healthyMembers >= this.performanceTargets.minHealthyMembers;

      // Cleanup test data
      await verificationCollection.deleteMany({
        _id: { $regex: `^(failover_test_${failoverTest.testId}_|post_failover_${failoverTest.testId})` }
      });

      console.log(`Failover test completed: ${failoverTest.results.success ? 'SUCCESS' : 'PARTIAL_SUCCESS'}`);
      console.log(`Total failover time: ${failoverTest.results.totalTimeMs}ms`);
      console.log(`Election time: ${failoverTest.results.electionTimeMs}ms`);
      console.log(`Data consistency: ${failoverTest.results.dataConsistencyVerified ? 'VERIFIED' : 'FAILED'}`);
      console.log(`Application connectivity: ${failoverTest.results.applicationConnectivityRestored ? 'RESTORED' : 'FAILED'}`);

      // Record failover test in history
      this.failoverHistory.push(failoverTest);

      return failoverTest;

    } catch (error) {
      console.error('Failover test failed:', error);
      failoverTest.phases.push({
        phase: 'error',
        timestamp: new Date(),
        status: 'failed',
        error: error.message
      });
      failoverTest.results.success = false;
      return failoverTest;
    }
  }

  async setupAdvancedReadPreferences(applications) {
    console.log('Setting up advanced read preferences for optimal performance...');

    const readPreferenceConfigurations = {
      // Real-time dashboard - prefer primary for latest data
      realtime_dashboard: {
        readPreference: 'primary',
        maxStalenessSeconds: 0,
        tags: [],
        description: 'Real-time data requires primary reads',
        useCase: 'Live dashboards, real-time analytics'
      },

      // Reporting queries - can use secondaries with some lag tolerance
      reporting_analytics: {
        readPreference: 'secondaryPreferred',
        maxStalenessSeconds: 30,
        tags: [{ region: 'us-east', workload: 'analytics' }],
        description: 'Analytics workload can tolerate slight lag',
        useCase: 'Business intelligence, historical reports'
      },

      // Geographically distributed reads
      geographic_reads: {
        readPreference: 'nearest',
        maxStalenessSeconds: 60,
        tags: [],
        description: 'Prioritize network proximity for user-facing reads',
        useCase: 'User-facing applications, content delivery'
      },

      // Heavy analytical workloads
      heavy_analytics: {
        readPreference: 'secondary',
        maxStalenessSeconds: 120,
        tags: [{ workload: 'analytics', ssd: 'true' }],
        description: 'Dedicated secondary for heavy analytical queries',
        useCase: 'Data mining, complex aggregations, ML training'
      },

      // Backup and archival operations
      backup_operations: {
        readPreference: 'secondary',
        maxStalenessSeconds: 300,
        tags: [{ backup: 'true', priority: 'low' }],
        description: 'Use dedicated backup secondary',
        useCase: 'Backup operations, data archival, compliance exports'
      }
    };

    const clientConfigurations = {};

    for (const [appName, app] of Object.entries(applications)) {
      const config = readPreferenceConfigurations[app.readPattern] || readPreferenceConfigurations.geographic_reads;

      console.log(`Configuring read preferences for ${appName}:`);
      console.log(`  Pattern: ${app.readPattern}`);
      console.log(`  Read Preference: ${config.readPreference}`);
      console.log(`  Max Staleness: ${config.maxStalenessSeconds}s`);

      clientConfigurations[appName] = {
        connectionString: this.buildConnectionString(config),
        readPreference: config.readPreference,
        readPreferenceTags: config.tags,
        maxStalenessSeconds: config.maxStalenessSeconds,

        // Additional client options for optimization
        options: {
          maxPoolSize: app.connectionPoolSize || 10,
          minPoolSize: app.minConnectionPoolSize || 2,
          maxIdleTimeMS: 30000,
          serverSelectionTimeoutMS: 5000,
          socketTimeoutMS: 45000,
          connectTimeoutMS: 10000,

          // Retry configuration
          retryWrites: true,
          retryReads: true,

          // Write concern based on application requirements
          writeConcern: app.writeConcern || { w: 'majority', j: true },

          // Read concern for consistency requirements
          readConcern: { level: app.readConcern || 'majority' }
        },

        // Monitoring configuration
        monitoring: {
          commandMonitoring: true,
          serverMonitoring: true,
          topologyMonitoring: true
        },

        description: config.description,
        useCase: config.useCase,
        optimizationTips: this.generateReadOptimizationTips(config, app)
      };
    }

    // Setup monitoring for read preference effectiveness
    await this.setupReadPreferenceMonitoring(clientConfigurations);

    console.log(`Read preference configurations created for ${Object.keys(clientConfigurations).length} applications`);

    return clientConfigurations;
  }

  async implementDisasterRecoveryProcedures(options = {}) {
    console.log('Implementing comprehensive disaster recovery procedures...');

    const {
      backupSchedule = 'daily',
      retentionPolicy = { daily: 7, weekly: 4, monthly: 6 },
      geographicDistribution = true,
      automaticFailback = false,
      rtoTarget = 300, // Recovery Time Objective in seconds
      rpoTarget = 60   // Recovery Point Objective in seconds
    } = options;

    const disasterRecoveryPlan = {
      backupStrategy: await this.implementBackupStrategy(backupSchedule, retentionPolicy),
      failoverProcedures: await this.implementFailoverProcedures(rtoTarget),
      recoveryValidation: await this.implementRecoveryValidation(),
      monitoringAndAlerting: await this.setupDisasterRecoveryMonitoring(),
      documentationAndRunbooks: await this.generateDisasterRecoveryRunbooks(),
      testingSchedule: await this.createDisasterRecoveryTestSchedule()
    };

    // Geographic distribution setup
    if (geographicDistribution) {
      disasterRecoveryPlan.geographicDistribution = await this.setupGeographicDistribution();
    }

    // Automatic failback configuration
    if (automaticFailback) {
      disasterRecoveryPlan.automaticFailback = await this.configureAutomaticFailback();
    }

    console.log('Disaster recovery procedures implemented successfully');
    return disasterRecoveryPlan;
  }

  async implementBackupStrategy(schedule, retentionPolicy) {
    console.log('Implementing comprehensive backup strategy...');

    const backupStrategy = {
      hotBackups: {
        enabled: true,
        schedule: schedule,
        method: 'mongodump_with_oplog',
        compression: true,
        encryption: true,
        storageLocation: ['local', 's3', 'gcs'],
        retentionPolicy: retentionPolicy
      },

      continuousBackup: {
        enabled: true,
        oplogTailing: true,
        changeStreams: true,
        pointInTimeRecovery: true,
        maxRecoveryWindow: '7 days'
      },

      consistencyChecks: {
        enabled: true,
        frequency: 'daily',
        validationMethods: ['checksum', 'document_count', 'index_integrity']
      },

      crossRegionReplication: {
        enabled: true,
        regions: ['us-east-1', 'us-west-2', 'eu-west-1'],
        replicationLag: '< 60 seconds'
      }
    };

    // Implement backup automation
    const backupJobs = await this.createAutomatedBackupJobs(backupStrategy);

    return {
      ...backupStrategy,
      automationJobs: backupJobs,
      estimatedRPO: this.calculateEstimatedRPO(backupStrategy),
      storageRequirements: this.calculateStorageRequirements(backupStrategy)
    };
  }

  async waitForPrimaryElection(timeoutMs = 30000) {
    console.log('Waiting for primary election...');

    const startTime = Date.now();
    const pollInterval = 1000; // Check every second

    while (Date.now() - startTime < timeoutMs) {
      try {
        const status = await this.db.runCommand({ replSetGetStatus: 1 });
        const primary = status.members.find(member => member.state === 1);

        if (primary) {
          console.log(`Primary elected: ${primary.name}`);
          return primary.name;
        }

        await new Promise(resolve => setTimeout(resolve, pollInterval));
      } catch (error) {
        // Connection might be lost during election, continue polling
        await new Promise(resolve => setTimeout(resolve, pollInterval));
      }
    }

    throw new Error(`Primary election timeout after ${timeoutMs}ms`);
  }

  generateHealthAlerts(healthReport) {
    const alerts = [];

    // Check for unhealthy members
    const unhealthyMembers = healthReport.memberHealth.filter(m => 
      ['unhealthy', 'down', 'unknown'].includes(m.status.overall)
    );

    if (unhealthyMembers.length > 0) {
      alerts.push({
        severity: 'HIGH',
        type: 'UNHEALTHY_MEMBERS',
        message: `${unhealthyMembers.length} replica set members are unhealthy`,
        members: unhealthyMembers.map(m => m.name),
        impact: 'Reduced fault tolerance and potential for data inconsistency'
      });
    }

    // Check replication lag
    const laggedMembers = Object.entries(healthReport.replicationLag)
      .filter(([, lag]) => lag > this.performanceTargets.maxReplicationLagSeconds);

    if (laggedMembers.length > 0) {
      alerts.push({
        severity: 'MEDIUM',
        type: 'REPLICATION_LAG',
        message: `${laggedMembers.length} members have excessive replication lag`,
        details: Object.fromEntries(laggedMembers),
        impact: 'Potential data loss during failover'
      });
    }

    // Check minimum healthy members threshold
    if (healthReport.replicaSetStatus.healthyMembers < this.performanceTargets.minHealthyMembers) {
      alerts.push({
        severity: 'CRITICAL',
        type: 'INSUFFICIENT_HEALTHY_MEMBERS',
        message: `Only ${healthReport.replicaSetStatus.healthyMembers} healthy members (minimum: ${this.performanceTargets.minHealthyMembers})`,
        impact: 'Risk of complete service outage if another member fails'
      });
    }

    return alerts;
  }

  generateHealthRecommendations(healthReport) {
    const recommendations = [];

    // Analyze member distribution
    const membersByState = healthReport.memberHealth.reduce((acc, member) => {
      acc[member.stateStr] = (acc[member.stateStr] || 0) + 1;
      return acc;
    }, {});

    if (membersByState.SECONDARY < 2) {
      recommendations.push({
        priority: 'HIGH',
        category: 'REDUNDANCY',
        recommendation: 'Add additional secondary members for better fault tolerance',
        reasoning: 'Minimum of 2 secondary members recommended for high availability',
        implementation: 'Use rs.add() to add new replica set members'
      });
    }

    // Check for arbiter usage
    if (membersByState.ARBITER > 0) {
      recommendations.push({
        priority: 'MEDIUM',
        category: 'ARCHITECTURE',
        recommendation: 'Consider replacing arbiters with data-bearing members',
        reasoning: 'Data-bearing members provide better fault tolerance than arbiters',
        implementation: 'Add data-bearing member and remove arbiter when safe'
      });
    }

    // Check geographic distribution
    const regions = new Set(healthReport.memberHealth
      .map(m => m.tags?.region)
      .filter(r => r)
    );

    if (regions.size < 2) {
      recommendations.push({
        priority: 'MEDIUM',
        category: 'DISASTER_RECOVERY',
        recommendation: 'Implement geographic distribution of replica set members',
        reasoning: 'Multi-region deployment protects against datacenter-level failures',
        implementation: 'Deploy members across multiple availability zones or regions'
      });
    }

    return recommendations;
  }

  buildConnectionString(config) {
    // Build MongoDB connection string with read preference options
    const params = new URLSearchParams();

    params.append('readPreference', config.readPreference);

    if (config.maxStalenessSeconds > 0) {
      params.append('maxStalenessSeconds', config.maxStalenessSeconds.toString());
    }

    if (config.tags && config.tags.length > 0) {
      config.tags.forEach((tag, index) => {
        Object.entries(tag).forEach(([key, value]) => {
          params.append(`readPreferenceTags[${index}][${key}]`, value);
        });
      });
    }

    return `${this.connectionString}?${params.toString()}`;
  }

  generateReadOptimizationTips(config, app) {
    const tips = [];

    if (config.readPreference === 'secondary' || config.readPreference === 'secondaryPreferred') {
      tips.push('Consider using connection pooling to maintain connections to multiple secondaries');
      tips.push('Monitor secondary lag to ensure data freshness meets application requirements');
    }

    if (config.maxStalenessSeconds > 60) {
      tips.push('Verify that application logic can handle potentially stale data');
      tips.push('Implement application-level caching for frequently accessed but slow-changing data');
    }

    if (app.queryTypes && app.queryTypes.includes('aggregation')) {
      tips.push('Heavy aggregation workloads benefit from dedicated secondary members with optimized hardware');
      tips.push('Consider using $merge or $out stages to pre-compute results on secondaries');
    }

    return tips;
  }

  async createAutomatedBackupJobs(backupStrategy) {
    // Implementation would create actual backup automation
    // This is a simplified representation
    return {
      dailyHotBackup: {
        schedule: '0 2 * * *', // 2 AM daily
        retention: backupStrategy.hotBackups.retentionPolicy.daily,
        enabled: true
      },
      continuousOplogBackup: {
        enabled: backupStrategy.continuousBackup.enabled,
        method: 'changeStreams'
      },
      weeklyFullBackup: {
        schedule: '0 1 * * 0', // 1 AM Sunday
        retention: backupStrategy.hotBackups.retentionPolicy.weekly,
        enabled: true
      }
    };
  }

  calculateEstimatedRPO(backupStrategy) {
    if (backupStrategy.continuousBackup.enabled) {
      return '< 1 minute'; // With oplog tailing
    } else {
      return '24 hours'; // With daily backups only
    }
  }

  calculateStorageRequirements(backupStrategy) {
    // Simplified storage calculation
    return {
      daily: 'Database size × compression ratio × daily retention',
      weekly: 'Database size × compression ratio × weekly retention', 
      monthly: 'Database size × compression ratio × monthly retention',
      estimated: 'Contact administrator for detailed storage analysis'
    };
  }

  async close() {
    if (this.client) {
      await this.client.close();
    }
  }
}

// Benefits of MongoDB Replica Sets:
// - Automatic failover with intelligent primary election algorithms
// - Strong consistency with configurable write and read concerns
// - Geographic distribution support for disaster recovery
// - Built-in health monitoring and self-healing capabilities
// - Flexible read preference configuration for performance optimization
// - Comprehensive backup and point-in-time recovery options
// - Zero-downtime member addition and removal
// - Advanced replication monitoring and alerting
// - Split-brain prevention through majority-based decisions
// - SQL-compatible high availability management through QueryLeaf integration

module.exports = {
  MongoReplicaSetManager
};

Understanding MongoDB Replica Set Architecture

Advanced High Availability Patterns and Strategies

Implement sophisticated replica set configurations for production environments:

// Advanced replica set patterns for enterprise deployments
class EnterpriseReplicaSetManager extends MongoReplicaSetManager {
  constructor(connectionString, enterpriseConfig) {
    super(connectionString);

    this.enterpriseConfig = {
      multiRegionDeployment: true,
      dedicatedAnalyticsNodes: true,
      priorityBasedElections: true,
      customWriteConcerns: true,
      advancedMonitoring: true,
      ...enterpriseConfig
    };

    this.deploymentTopology = new Map();
    this.performanceOptimizations = new Map();
  }

  async deployGeographicallyDistributedReplicaSet(regions) {
    console.log('Deploying geographically distributed replica set...');

    const topology = {
      regions: regions,
      memberDistribution: this.calculateOptimalMemberDistribution(regions),
      networkLatencyMatrix: await this.measureInterRegionLatency(regions),
      failoverStrategy: this.designFailoverStrategy(regions)
    };

    // Configure members with geographic awareness
    const members = [];
    let memberIndex = 0;

    for (const region of regions) {
      const regionConfig = topology.memberDistribution[region.name];

      for (let i = 0; i < regionConfig.dataMembers; i++) {
        members.push({
          _id: memberIndex++,
          host: `${region.name}-data-${i}.${region.domain}:27017`,
          priority: regionConfig.priority,
          votes: 1,
          tags: {
            region: region.name,
            datacenter: region.datacenter,
            nodeType: 'data',
            ssd: 'true',
            workload: i === 0 ? 'primary' : 'secondary'
          }
        });
      }

      // Add analytics-dedicated members
      if (regionConfig.analyticsMembers > 0) {
        for (let i = 0; i < regionConfig.analyticsMembers; i++) {
          members.push({
            _id: memberIndex++,
            host: `${region.name}-analytics-${i}.${region.domain}:27017`,
            priority: 0, // Never become primary
            votes: 1,
            tags: {
              region: region.name,
              datacenter: region.datacenter,
              nodeType: 'analytics',
              workload: 'analytics',
              ssd: 'true'
            },
            hidden: true // Hidden from application discovery
          });
        }
      }

      // Add arbiter if needed for odd number of voting members
      if (regionConfig.needsArbiter) {
        members.push({
          _id: memberIndex++,
          host: `${region.name}-arbiter.${region.domain}:27017`,
          arbiterOnly: true,
          priority: 0,
          votes: 1,
          tags: {
            region: region.name,
            datacenter: region.datacenter,
            nodeType: 'arbiter'
          }
        });
      }
    }

    // Configure advanced settings for geographic distribution
    const replicaSetConfig = {
      _id: 'global-rs',
      version: 1,
      members: members,
      settings: {
        chainingAllowed: true,
        heartbeatIntervalMillis: 2000,
        heartbeatTimeoutSecs: 10,
        electionTimeoutMillis: 10000,
        catchUpTimeoutMillis: 60000,

        // Custom write concerns for multi-region safety
        getLastErrorModes: {
          // Require writes to be acknowledged by majority in each region
          multiRegion: Object.fromEntries(
            regions.map(r => [r.name, 1])
          ),
          // Require acknowledgment from majority of data centers
          multiDataCenter: { datacenter: Math.ceil(regions.length / 2) },
          // For critical operations, require all regions
          allRegions: Object.fromEntries(
            regions.map(r => [r.name, 1])
          )
        },

        getLastErrorDefaults: {
          w: 'multiRegion',
          j: true,
          wtimeout: 15000 // Higher timeout for geographic distribution
        }
      }
    };

    // Initialize the distributed replica set
    const initResult = await this.initializeReplicaSet(members, {
      replicaSetName: 'global-rs',
      writeConcern: { w: 'multiRegion', j: true },
      readPreference: 'primaryPreferred'
    });

    if (initResult.success) {
      // Configure regional read preferences
      await this.configureRegionalReadPreferences(regions);

      // Setup cross-region monitoring
      await this.setupCrossRegionMonitoring(regions);

      // Validate network connectivity and latency
      await this.validateCrossRegionConnectivity(regions);
    }

    return {
      topology: topology,
      replicaSetConfig: replicaSetConfig,
      initResult: initResult,
      optimizations: await this.generateGlobalOptimizations(topology)
    };
  }

  async implementZeroDowntimeMaintenance(maintenancePlan) {
    console.log('Implementing zero-downtime maintenance procedures...');

    const maintenance = {
      planId: require('crypto').randomUUID(),
      startTime: new Date(),
      phases: [],
      rollbackPlan: null,
      success: false
    };

    try {
      // Phase 1: Pre-maintenance health check
      const preMaintenanceHealth = await this.performComprehensiveHealthCheck();

      if (preMaintenanceHealth.alerts.some(alert => alert.severity === 'CRITICAL')) {
        throw new Error('Cannot perform maintenance: critical health issues detected');
      }

      maintenance.phases.push({
        phase: 'pre_maintenance_health_check',
        status: 'completed',
        timestamp: new Date(),
        data: { healthyMembers: preMaintenanceHealth.replicaSetStatus.healthyMembers }
      });

      // Phase 2: Create maintenance plan execution order
      const executionOrder = this.createMaintenanceExecutionOrder(maintenancePlan, preMaintenanceHealth);

      maintenance.phases.push({
        phase: 'execution_order_planning',
        status: 'completed',
        timestamp: new Date(),
        data: { executionOrder: executionOrder }
      });

      // Phase 3: Execute maintenance on each member
      for (const step of executionOrder) {
        console.log(`Executing maintenance step: ${step.description}`);

        const stepResult = await this.executeMaintenanceStep(step);

        maintenance.phases.push({
          phase: `maintenance_step_${step.memberId}`,
          status: stepResult.success ? 'completed' : 'failed',
          timestamp: new Date(),
          data: stepResult
        });

        if (!stepResult.success && step.critical) {
          throw new Error(`Critical maintenance step failed: ${step.description}`);
        }

        // Wait for member to rejoin and catch up
        if (stepResult.requiresRejoin) {
          await this.waitForMemberRecovery(step.memberId, 300000); // 5 minute timeout
        }

        // Validate cluster health before proceeding
        const intermediateHealth = await this.performComprehensiveHealthCheck();
        if (intermediateHealth.replicaSetStatus.healthyMembers < this.performanceTargets.minHealthyMembers) {
          throw new Error('Insufficient healthy members to continue maintenance');
        }
      }

      // Phase 4: Post-maintenance validation
      const postMaintenanceHealth = await this.performComprehensiveHealthCheck();
      const validationResult = await this.validateMaintenanceCompletion(maintenancePlan, postMaintenanceHealth);

      maintenance.phases.push({
        phase: 'post_maintenance_validation',
        status: validationResult.success ? 'completed' : 'failed',
        timestamp: new Date(),
        data: validationResult
      });

      maintenance.success = validationResult.success;
      maintenance.endTime = new Date();
      maintenance.totalDurationMs = maintenance.endTime - maintenance.startTime;

      console.log(`Zero-downtime maintenance ${maintenance.success ? 'completed successfully' : 'completed with issues'}`);
      console.log(`Total duration: ${maintenance.totalDurationMs}ms`);

      return maintenance;

    } catch (error) {
      console.error('Maintenance procedure failed:', error);

      maintenance.phases.push({
        phase: 'error',
        status: 'failed',
        timestamp: new Date(),
        error: error.message
      });

      // Attempt rollback if configured
      if (maintenance.rollbackPlan) {
        console.log('Attempting rollback...');
        const rollbackResult = await this.executeRollback(maintenance.rollbackPlan);
        maintenance.rollback = rollbackResult;
      }

      maintenance.success = false;
      maintenance.endTime = new Date();
      return maintenance;
    }
  }

  calculateOptimalMemberDistribution(regions) {
    const totalRegions = regions.length;
    const distribution = {};

    if (totalRegions === 1) {
      // Single region deployment
      distribution[regions[0].name] = {
        dataMembers: 3,
        analyticsMembers: 1,
        priority: 1,
        needsArbiter: false
      };
    } else if (totalRegions === 2) {
      // Two region deployment - need arbiter for odd voting members
      distribution[regions[0].name] = {
        dataMembers: 2,
        analyticsMembers: 1,
        priority: 1,
        needsArbiter: false
      };
      distribution[regions[1].name] = {
        dataMembers: 2,
        analyticsMembers: 1,
        priority: 0.5,
        needsArbiter: true // Add arbiter to prevent split-brain
      };
    } else if (totalRegions >= 3) {
      // Multi-region deployment with primary preference
      const primaryRegion = regions[0];
      distribution[primaryRegion.name] = {
        dataMembers: 2,
        analyticsMembers: 1,
        priority: 1,
        needsArbiter: false
      };

      regions.slice(1).forEach((region, index) => {
        distribution[region.name] = {
          dataMembers: 1,
          analyticsMembers: index === 0 ? 1 : 0, // Analytics in first secondary region
          priority: 0.5 - (index * 0.1), // Decreasing priority
          needsArbiter: false
        };
      });
    }

    return distribution;
  }

  async measureInterRegionLatency(regions) {
    console.log('Measuring inter-region network latency...');

    const latencyMatrix = {};

    for (const sourceRegion of regions) {
      latencyMatrix[sourceRegion.name] = {};

      for (const targetRegion of regions) {
        if (sourceRegion.name === targetRegion.name) {
          latencyMatrix[sourceRegion.name][targetRegion.name] = 0;
          continue;
        }

        try {
          // Simulate latency measurement (in production, use actual network tests)
          const estimatedLatency = this.estimateLatencyBetweenRegions(sourceRegion, targetRegion);
          latencyMatrix[sourceRegion.name][targetRegion.name] = estimatedLatency;

        } catch (error) {
          console.warn(`Failed to measure latency between ${sourceRegion.name} and ${targetRegion.name}:`, error.message);
          latencyMatrix[sourceRegion.name][targetRegion.name] = 999; // High value for unreachable
        }
      }
    }

    return latencyMatrix;
  }

  estimateLatencyBetweenRegions(source, target) {
    // Simplified latency estimation based on geographic distance
    const latencyMap = {
      'us-east-1_us-west-2': 70,
      'us-east-1_eu-west-1': 85,
      'us-west-2_eu-west-1': 140,
      'us-east-1_ap-southeast-1': 180,
      'us-west-2_ap-southeast-1': 120,
      'eu-west-1_ap-southeast-1': 160
    };

    const key = `${source.name}_${target.name}`;
    const reverseKey = `${target.name}_${source.name}`;

    return latencyMap[key] || latencyMap[reverseKey] || 200; // Default high latency
  }

  designFailoverStrategy(regions) {
    return {
      primaryRegionFailure: {
        strategy: 'automatic_election',
        timeoutMs: 10000,
        requiredVotes: Math.ceil((regions.length * 2 + 1) / 2) // Majority
      },

      networkPartition: {
        strategy: 'majority_partition_wins',
        description: 'Partition with majority of voting members continues operation'
      },

      crossRegionReplication: {
        strategy: 'eventual_consistency',
        maxLagSeconds: 60,
        description: 'Accept eventual consistency during network issues'
      }
    };
  }

  async waitForMemberRecovery(memberId, timeoutMs) {
    console.log(`Waiting for member ${memberId} to recover...`);

    const startTime = Date.now();
    const pollInterval = 5000; // Check every 5 seconds

    while (Date.now() - startTime < timeoutMs) {
      try {
        const status = await this.db.runCommand({ replSetGetStatus: 1 });
        const member = status.members.find(m => m._id === memberId);

        if (member && [1, 2].includes(member.state)) { // PRIMARY or SECONDARY
          console.log(`Member ${memberId} recovered successfully`);
          return true;
        }

        await new Promise(resolve => setTimeout(resolve, pollInterval));
      } catch (error) {
        console.warn(`Error checking member ${memberId} status:`, error.message);
        await new Promise(resolve => setTimeout(resolve, pollInterval));
      }
    }

    throw new Error(`Member ${memberId} recovery timeout after ${timeoutMs}ms`);
  }

  createMaintenanceExecutionOrder(maintenancePlan, healthStatus) {
    const executionOrder = [];

    // Always start with secondaries, then primary
    const secondaries = healthStatus.memberHealth
      .filter(m => m.stateStr === 'SECONDARY')
      .sort((a, b) => (b.priority || 0) - (a.priority || 0)); // Highest priority secondary first

    const primary = healthStatus.memberHealth.find(m => m.stateStr === 'PRIMARY');

    // Add secondary maintenance steps
    secondaries.forEach((member, index) => {
      executionOrder.push({
        memberId: member._id,
        memberName: member.name,
        description: `Maintenance on secondary: ${member.name}`,
        critical: false,
        requiresRejoin: maintenancePlan.requiresRestart,
        estimatedDurationMs: maintenancePlan.estimatedDurationMs || 300000,
        order: index
      });
    });

    // Add primary maintenance step (with step-down)
    if (primary) {
      executionOrder.push({
        memberId: primary._id,
        memberName: primary.name,
        description: `Maintenance on primary: ${primary.name} (with step-down)`,
        critical: true,
        requiresRejoin: maintenancePlan.requiresRestart,
        requiresStepDown: true,
        estimatedDurationMs: (maintenancePlan.estimatedDurationMs || 300000) + 30000, // Extra time for election
        order: secondaries.length
      });
    }

    return executionOrder;
  }

  async executeMaintenanceStep(step) {
    console.log(`Executing maintenance step: ${step.description}`);

    try {
      // Step down primary if required
      if (step.requiresStepDown) {
        console.log(`Stepping down primary: ${step.memberName}`);
        await this.db.runCommand({ 
          replSetStepDown: Math.ceil(step.estimatedDurationMs / 1000) + 60, // Add buffer
          force: false 
        });

        // Wait for new primary election
        await this.waitForPrimaryElection(30000);
      }

      // Simulate maintenance operation (replace with actual maintenance logic)
      console.log(`Performing maintenance on ${step.memberName}...`);
      await new Promise(resolve => setTimeout(resolve, 5000)); // Simulate maintenance work

      return {
        success: true,
        memberId: step.memberId,
        memberName: step.memberName,
        requiresRejoin: step.requiresRejoin,
        completionTime: new Date()
      };

    } catch (error) {
      console.error(`Maintenance step failed for ${step.memberName}:`, error);
      return {
        success: false,
        memberId: step.memberId,
        memberName: step.memberName,
        error: error.message,
        requiresRejoin: false
      };
    }
  }

  async validateMaintenanceCompletion(maintenancePlan, postMaintenanceHealth) {
    console.log('Validating maintenance completion...');

    const validation = {
      success: true,
      checks: [],
      issues: []
    };

    // Check that all members are healthy
    const healthyMembers = postMaintenanceHealth.memberHealth
      .filter(m => ['primary', 'healthy'].includes(m.status.overall));

    validation.checks.push({
      check: 'member_health',
      passed: healthyMembers.length >= this.performanceTargets.minHealthyMembers,
      details: `${healthyMembers.length} healthy members (minimum: ${this.performanceTargets.minHealthyMembers})`
    });

    // Check replication lag
    const maxLag = Math.max(...Object.values(postMaintenanceHealth.replicationLag));
    validation.checks.push({
      check: 'replication_lag',
      passed: maxLag <= this.performanceTargets.maxReplicationLagSeconds,
      details: `Maximum lag: ${maxLag}s (target: ${this.performanceTargets.maxReplicationLagSeconds}s)`
    });

    // Check for any alerts
    const criticalAlerts = postMaintenanceHealth.alerts
      .filter(alert => alert.severity === 'CRITICAL');

    validation.checks.push({
      check: 'critical_alerts',
      passed: criticalAlerts.length === 0,
      details: `${criticalAlerts.length} critical alerts`
    });

    // Overall success determination
    validation.success = validation.checks.every(check => check.passed);

    if (!validation.success) {
      validation.issues = validation.checks
        .filter(check => !check.passed)
        .map(check => `${check.check}: ${check.details}`);
    }

    return validation;
  }
}

SQL-Style Replica Set Management with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB replica set management and monitoring:

-- QueryLeaf replica set management with SQL-familiar syntax

-- Create replica set with advanced configuration
CREATE REPLICA SET global_ecommerce_rs WITH (
  members = [
    { host = 'us-east-primary-1.company.com:27017', priority = 1.0, tags = { region = 'us-east', datacenter = 'dc1' } },
    { host = 'us-east-secondary-1.company.com:27017', priority = 0.5, tags = { region = 'us-east', datacenter = 'dc2' } },
    { host = 'us-west-secondary-1.company.com:27017', priority = 0.3, tags = { region = 'us-west', datacenter = 'dc3' } },
    { host = 'eu-west-secondary-1.company.com:27017', priority = 0.3, tags = { region = 'eu-west', datacenter = 'dc4' } },
    { host = 'analytics-secondary-1.company.com:27017', priority = 0, hidden = true, tags = { workload = 'analytics' } }
  ],

  -- Advanced replica set settings
  heartbeat_interval = '2 seconds',
  election_timeout = '10 seconds',
  catchup_timeout = '60 seconds',

  -- Custom write concerns for multi-region safety
  write_concerns = {
    multi_region = { us_east = 1, us_west = 1, eu_west = 1 },
    majority_datacenter = { datacenter = 3 },
    analytics_safe = { workload_analytics = 0, datacenter = 2 }
  },

  default_write_concern = { w = 'multi_region', j = true, wtimeout = '15 seconds' }
);

-- Monitor replica set health with comprehensive metrics
WITH replica_set_health AS (
  SELECT 
    member_name,
    member_state,
    member_state_str,
    health_status,
    uptime_seconds,
    ping_ms,

    -- Replication lag calculation
    EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - optime_date)) as replication_lag_seconds,

    -- Member performance assessment
    CASE member_state
      WHEN 1 THEN 'PRIMARY'
      WHEN 2 THEN 'SECONDARY'
      WHEN 7 THEN 'ARBITER'
      WHEN 8 THEN 'DOWN'
      WHEN 3 THEN 'RECOVERING'
      ELSE 'UNKNOWN'
    END as role,

    -- Health grade assignment
    CASE 
      WHEN health_status = 1 AND member_state IN (1, 2) AND ping_ms < 50 THEN 'A'
      WHEN health_status = 1 AND member_state IN (1, 2) AND ping_ms < 100 THEN 'B'
      WHEN health_status = 1 AND member_state IN (1, 2, 7) THEN 'C'
      WHEN health_status = 1 AND member_state NOT IN (1, 2, 7) THEN 'D'
      ELSE 'F'
    END as health_grade,

    -- Network performance indicators
    CASE
      WHEN ping_ms IS NULL THEN 'UNREACHABLE'
      WHEN ping_ms < 10 THEN 'EXCELLENT'
      WHEN ping_ms < 50 THEN 'GOOD'
      WHEN ping_ms < 100 THEN 'ACCEPTABLE'
      WHEN ping_ms < 250 THEN 'POOR'
      ELSE 'CRITICAL'
    END as network_performance,

    -- Extract member tags for analysis
    member_tags.region as member_region,
    member_tags.datacenter as member_datacenter,
    member_tags.workload as member_workload,
    sync_source_host

  FROM rs_status()  -- QueryLeaf function to get replica set status
),

replication_analysis AS (
  SELECT 
    member_region,
    member_datacenter,
    role,

    -- Regional distribution analysis
    COUNT(*) as members_in_region,
    COUNT(*) FILTER (WHERE role = 'SECONDARY') as secondaries_in_region,
    COUNT(*) FILTER (WHERE health_grade IN ('A', 'B')) as healthy_members_in_region,

    -- Performance metrics by region
    AVG(replication_lag_seconds) as avg_replication_lag,
    MAX(replication_lag_seconds) as max_replication_lag,
    AVG(ping_ms) as avg_network_latency,
    MAX(ping_ms) as max_network_latency,

    -- Health distribution
    COUNT(*) FILTER (WHERE health_grade = 'A') as grade_a_members,
    COUNT(*) FILTER (WHERE health_grade = 'B') as grade_b_members,
    COUNT(*) FILTER (WHERE health_grade IN ('D', 'F')) as problematic_members,

    -- Fault tolerance assessment
    CASE
      WHEN COUNT(*) FILTER (WHERE role IN ('PRIMARY', 'SECONDARY') AND health_grade IN ('A', 'B')) >= 2 
      THEN 'FAULT_TOLERANT'
      WHEN COUNT(*) FILTER (WHERE role IN ('PRIMARY', 'SECONDARY')) >= 2 
      THEN 'MINIMAL_REDUNDANCY'
      ELSE 'AT_RISK'
    END as fault_tolerance_status

  FROM replica_set_health
  WHERE role != 'ARBITER'  -- Exclude arbiters from data analysis
  GROUP BY member_region, member_datacenter, role
),

failover_readiness_assessment AS (
  SELECT 
    rh.member_name,
    rh.role,
    rh.health_grade,
    rh.replication_lag_seconds,
    rh.member_region,

    -- Failover readiness scoring
    CASE 
      WHEN rh.role = 'PRIMARY' THEN 'N/A - Current Primary'
      WHEN rh.role = 'SECONDARY' AND rh.health_grade IN ('A', 'B') AND rh.replication_lag_seconds < 10 THEN 'READY'
      WHEN rh.role = 'SECONDARY' AND rh.health_grade = 'C' AND rh.replication_lag_seconds < 30 THEN 'ACCEPTABLE'
      WHEN rh.role = 'SECONDARY' AND rh.replication_lag_seconds < 120 THEN 'DELAYED'
      ELSE 'NOT_READY'
    END as failover_readiness,

    -- Estimated failover time
    CASE 
      WHEN rh.role = 'SECONDARY' AND rh.health_grade IN ('A', 'B') AND rh.replication_lag_seconds < 10 
      THEN '< 15 seconds'
      WHEN rh.role = 'SECONDARY' AND rh.replication_lag_seconds < 60 
      THEN '15-45 seconds'  
      WHEN rh.role = 'SECONDARY' AND rh.replication_lag_seconds < 300 
      THEN '1-5 minutes'
      ELSE '> 5 minutes or unknown'
    END as estimated_failover_time,

    -- Regional failover preference
    ROW_NUMBER() OVER (
      PARTITION BY rh.member_region 
      ORDER BY 
        CASE rh.health_grade WHEN 'A' THEN 1 WHEN 'B' THEN 2 WHEN 'C' THEN 3 ELSE 4 END,
        rh.replication_lag_seconds,
        rh.ping_ms
    ) as regional_failover_preference

  FROM replica_set_health rh
  WHERE rh.role IN ('PRIMARY', 'SECONDARY')
)

-- Comprehensive replica set status report
SELECT 
  'REPLICA SET HEALTH SUMMARY' as report_section,

  -- Overall cluster health
  (SELECT COUNT(*) FROM replica_set_health WHERE health_grade IN ('A', 'B')) as healthy_members,
  (SELECT COUNT(*) FROM replica_set_health WHERE role IN ('PRIMARY', 'SECONDARY')) as data_bearing_members,
  (SELECT COUNT(DISTINCT member_region) FROM replica_set_health) as regions_covered,
  (SELECT COUNT(DISTINCT member_datacenter) FROM replica_set_health) as datacenters_covered,

  -- Performance indicators
  (SELECT ROUND(AVG(replication_lag_seconds)::numeric, 2) FROM replica_set_health WHERE role = 'SECONDARY') as avg_replication_lag_sec,
  (SELECT ROUND(MAX(replication_lag_seconds)::numeric, 2) FROM replica_set_health WHERE role = 'SECONDARY') as max_replication_lag_sec,
  (SELECT ROUND(AVG(ping_ms)::numeric, 1) FROM replica_set_health WHERE ping_ms IS NOT NULL) as avg_network_latency_ms,

  -- Fault tolerance assessment
  (SELECT fault_tolerance_status FROM replication_analysis LIMIT 1) as overall_fault_tolerance,

  -- Failover readiness
  (SELECT COUNT(*) FROM failover_readiness_assessment WHERE failover_readiness = 'READY') as failover_ready_secondaries,
  (SELECT member_name FROM failover_readiness_assessment WHERE regional_failover_preference = 1 AND role = 'SECONDARY' ORDER BY replication_lag_seconds LIMIT 1) as preferred_failover_candidate

UNION ALL

-- Regional distribution analysis
SELECT 
  'REGIONAL DISTRIBUTION' as report_section,

  member_region as region,
  members_in_region,
  secondaries_in_region,  
  healthy_members_in_region,
  ROUND(avg_replication_lag::numeric, 2) as avg_lag_sec,
  ROUND(avg_network_latency::numeric, 1) as avg_latency_ms,
  fault_tolerance_status,

  -- Regional health grade
  CASE 
    WHEN problematic_members = 0 AND grade_a_members >= 1 THEN 'EXCELLENT'
    WHEN problematic_members = 0 AND healthy_members_in_region >= 1 THEN 'GOOD'
    WHEN problematic_members <= 1 THEN 'ACCEPTABLE'
    ELSE 'NEEDS_ATTENTION'
  END as regional_health_grade

FROM replication_analysis
WHERE member_region IS NOT NULL

UNION ALL

-- Failover readiness details
SELECT 
  'FAILOVER READINESS' as report_section,

  member_name,
  role,
  health_grade,
  failover_readiness,
  estimated_failover_time,
  member_region,

  CASE 
    WHEN failover_readiness = 'READY' THEN 'Can handle immediate failover'
    WHEN failover_readiness = 'ACCEPTABLE' THEN 'Can handle failover with short delay'
    WHEN failover_readiness = 'DELAYED' THEN 'Requires catch-up time before failover'
    ELSE 'Not suitable for failover'
  END as failover_notes

FROM failover_readiness_assessment
ORDER BY 
  CASE failover_readiness 
    WHEN 'READY' THEN 1 
    WHEN 'ACCEPTABLE' THEN 2 
    WHEN 'DELAYED' THEN 3 
    ELSE 4 
  END,
  replication_lag_seconds;

-- Advanced read preference configuration
CREATE READ PREFERENCE CONFIGURATION application_read_preferences AS (

  -- Real-time dashboard queries - require primary for consistency
  real_time_dashboard = {
    read_preference = 'primary',
    max_staleness = '0 seconds',
    tags = {},
    description = 'Live dashboards requiring immediate consistency'
  },

  -- Business intelligence queries - can use secondaries
  business_intelligence = {
    read_preference = 'secondaryPreferred',
    max_staleness = '30 seconds', 
    tags = [{ workload = 'analytics' }, { region = 'us-east' }],
    description = 'BI queries with slight staleness tolerance'
  },

  -- Geographic user queries - prefer regional secondaries
  geographic_user_queries = {
    read_preference = 'nearest',
    max_staleness = '60 seconds',
    tags = [{ region = '${user_region}' }],
    description = 'User-facing queries optimized for geographic proximity'
  },

  -- Reporting and archival - use dedicated analytics secondary
  reporting_archival = {
    read_preference = 'secondary',
    max_staleness = '300 seconds',
    tags = [{ workload = 'analytics' }, { hidden = 'true' }],
    description = 'Heavy reporting queries isolated from primary workload'
  },

  -- Backup operations - use specific backup-designated secondary
  backup_operations = {
    read_preference = 'secondary', 
    max_staleness = '600 seconds',
    tags = [{ backup = 'true' }],
    description = 'Backup and compliance operations'
  }
);

-- Automatic failover testing and validation
CREATE FAILOVER TEST PROCEDURE comprehensive_failover_test AS (

  -- Test configuration
  test_duration = '5 minutes',
  data_consistency_validation = true,
  application_connectivity_testing = true,
  performance_impact_measurement = true,

  -- Test phases
  phases = [
    {
      phase = 'pre_test_health_check',
      description = 'Validate cluster health before testing',
      required_healthy_members = 3,
      max_replication_lag = '30 seconds'
    },

    {
      phase = 'test_data_insertion', 
      description = 'Insert test data for consistency verification',
      test_documents = 1000,
      write_concern = { w = 'majority', j = true }
    },

    {
      phase = 'primary_step_down',
      description = 'Force primary to step down',
      step_down_duration = '300 seconds',
      force_step_down = false
    },

    {
      phase = 'election_monitoring',
      description = 'Monitor primary election process', 
      max_election_time = '30 seconds',
      log_election_details = true
    },

    {
      phase = 'connectivity_validation',
      description = 'Test application connectivity to new primary',
      connection_timeout = '10 seconds',
      retry_attempts = 3
    },

    {
      phase = 'data_consistency_check',
      description = 'Verify data consistency after failover',
      verify_test_data = true,
      checksum_validation = true
    },

    {
      phase = 'performance_assessment',
      description = 'Measure failover impact on performance',
      metrics = ['election_time', 'connectivity_restore_time', 'replication_catch_up_time']
    }
  ],

  -- Success criteria
  success_criteria = {
    max_election_time = '30 seconds',
    data_consistency = 'required',
    zero_data_loss = 'required',
    application_connectivity_restore = '< 60 seconds'
  },

  -- Automated scheduling
  schedule = 'monthly',
  notification_recipients = ['dba-team@company.com', 'ops-team@company.com']
);

-- Disaster recovery configuration and procedures
CREATE DISASTER RECOVERY PLAN enterprise_dr_plan AS (

  -- Backup strategy
  backup_strategy = {
    hot_backups = {
      frequency = 'daily',
      retention = '30 days',
      compression = true,
      encryption = true,
      storage_locations = ['s3://company-mongo-backups', 'gcs://company-mongo-dr']
    },

    continuous_backup = {
      oplog_tailing = true,
      change_streams = true,
      point_in_time_recovery = true,
      max_recovery_window = '7 days'
    },

    cross_region_replication = {
      enabled = true,
      target_regions = ['us-west-2', 'eu-central-1'],
      replication_lag_target = '< 60 seconds'
    }
  },

  -- Recovery procedures
  recovery_procedures = {

    -- Single member failure
    member_failure = {
      detection_time_target = '< 30 seconds',
      automatic_response = true,
      procedures = [
        'Automatic failover via replica set election',
        'Alert operations team',
        'Provision replacement member',
        'Add replacement to replica set',
        'Monitor replication catch-up'
      ]
    },

    -- Regional failure  
    regional_failure = {
      detection_time_target = '< 2 minutes',
      automatic_response = 'partial',
      procedures = [
        'Automatic failover to available regions',
        'Redirect application traffic',
        'Scale remaining regions for increased load',
        'Provision new regional deployment', 
        'Restore full geographic distribution'
      ]
    },

    -- Complete cluster failure
    complete_failure = {
      detection_time_target = '< 5 minutes',
      automatic_response = false,
      procedures = [
        'Activate disaster recovery plan',
        'Restore from most recent backup',
        'Apply oplog entries for point-in-time recovery',
        'Provision new cluster infrastructure',
        'Validate data integrity',
        'Redirect application traffic to recovered cluster'
      ]
    }
  },

  -- RTO/RPO targets
  recovery_targets = {
    member_failure = { rto = '< 1 minute', rpo = '0 seconds' },
    regional_failure = { rto = '< 5 minutes', rpo = '< 30 seconds' },
    complete_failure = { rto = '< 2 hours', rpo = '< 15 minutes' }
  },

  -- Testing and validation
  testing_schedule = {
    failover_tests = 'monthly',
    disaster_recovery_drills = 'quarterly', 
    backup_restoration_tests = 'weekly',
    cross_region_connectivity_tests = 'daily'
  }
);

-- Real-time monitoring and alerting configuration
CREATE MONITORING CONFIGURATION replica_set_monitoring AS (

  -- Health check intervals
  health_check_interval = '10 seconds',
  performance_sampling_interval = '30 seconds',
  trend_analysis_window = '1 hour',

  -- Alert thresholds
  alert_thresholds = {

    -- Replication lag alerts
    replication_lag = {
      warning = '30 seconds',
      critical = '2 minutes',
      escalation = '5 minutes'
    },

    -- Member health alerts  
    member_health = {
      warning = 'any_member_down',
      critical = 'primary_down_or_majority_unavailable',
      escalation = 'split_brain_detected'
    },

    -- Network latency alerts
    network_latency = {
      warning = '100 ms average',
      critical = '500 ms average', 
      escalation = 'member_unreachable'
    },

    -- Election frequency alerts
    election_frequency = {
      warning = '2 elections per hour',
      critical = '5 elections per hour',
      escalation = 'continuous_election_cycling'
    }
  },

  -- Notification configuration
  notifications = {
    email = ['dba-team@company.com', 'ops-team@company.com'],
    slack = '#database-alerts',
    pagerduty = 'mongodb-replica-set-service',
    webhook = 'https://monitoring.company.com/mongodb-alerts'
  },

  -- Automated responses
  automated_responses = {
    member_down = 'log_alert_and_notify',
    high_replication_lag = 'investigate_and_notify',
    primary_election = 'log_details_and_validate_health',
    split_brain_detection = 'immediate_escalation'
  }
);

-- QueryLeaf provides comprehensive replica set management:
-- 1. SQL-familiar syntax for replica set creation and configuration
-- 2. Advanced health monitoring with comprehensive metrics and alerting
-- 3. Automated failover testing and validation procedures
-- 4. Sophisticated read preference management for performance optimization
-- 5. Comprehensive disaster recovery planning and implementation
-- 6. Real-time monitoring with customizable thresholds and notifications
-- 7. Geographic distribution management for multi-region deployments  
-- 8. Zero-downtime maintenance procedures with automatic validation
-- 9. Performance impact assessment and optimization recommendations
-- 10. Integration with MongoDB's native replica set functionality

Best Practices for Replica Set Implementation

High Availability Design Principles

Essential guidelines for robust MongoDB replica set deployments:

  1. Odd Number of Voting Members: Always maintain an odd number of voting members to prevent split-brain scenarios
  2. Geographic Distribution: Deploy members across multiple availability zones or regions for disaster recovery
  3. Resource Planning: Size replica set members appropriately for expected workload and failover scenarios
  4. Network Optimization: Ensure low-latency, high-bandwidth connections between replica set members
  5. Monitoring Integration: Implement comprehensive monitoring with proactive alerting for health and performance
  6. Regular Testing: Conduct regular failover tests and disaster recovery drills to validate procedures

Operational Excellence

Optimize replica set operations for production environments:

  1. Automated Deployment: Use infrastructure as code for consistent replica set deployments
  2. Configuration Management: Maintain consistent configuration across all replica set members
  3. Security Implementation: Enable authentication, authorization, and encryption for all replica communications
  4. Backup Strategy: Implement multiple backup strategies including hot backups and point-in-time recovery
  5. Performance Monitoring: Track replication lag, network latency, and resource utilization continuously
  6. Documentation Maintenance: Keep runbooks and procedures updated with current configuration and processes

Conclusion

MongoDB's replica set architecture provides comprehensive high availability and disaster recovery capabilities that eliminate the complexity and limitations of traditional database replication systems. The sophisticated election algorithms, automatic failover mechanisms, and flexible configuration options ensure business continuity even during catastrophic failures while maintaining data consistency and application performance.

Key MongoDB Replica Set benefits include:

  • Automatic Failover: Intelligent primary election with no manual intervention required
  • Strong Consistency: Configurable write and read concerns for application-specific consistency requirements
  • Geographic Distribution: Multi-region deployment support for comprehensive disaster recovery
  • Zero Downtime Operations: Add, remove, and maintain replica set members without service interruption
  • Flexible Read Scaling: Advanced read preference configuration for optimal performance distribution
  • Comprehensive Monitoring: Built-in health monitoring with detailed metrics and alerting capabilities

Whether you're building resilient e-commerce platforms, financial applications, or global content delivery systems, MongoDB's replica sets with QueryLeaf's familiar SQL interface provide the foundation for mission-critical high availability infrastructure.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB replica set operations while providing SQL-familiar syntax for replica set creation, health monitoring, and disaster recovery procedures. Advanced high availability patterns, automated failover testing, and comprehensive monitoring are seamlessly handled through familiar SQL constructs, making sophisticated database resilience both powerful and accessible to SQL-oriented operations teams.

The combination of MongoDB's robust replica set capabilities with SQL-style operations makes it an ideal platform for applications requiring both high availability and familiar database management patterns, ensuring your applications maintain continuous operation while remaining manageable as they scale globally.

MongoDB Aggregation Framework Optimization: Advanced Performance Strategies for Complex Data Processing Pipelines

Complex data analysis and processing require sophisticated aggregation capabilities that can handle large datasets efficiently while maintaining query performance and resource optimization. The MongoDB Aggregation Framework provides a powerful pipeline-based approach to data transformation, filtering, grouping, and analysis that scales from simple queries to complex analytical workloads.

MongoDB's aggregation pipeline enables developers to build sophisticated data processing workflows using a series of stages that transform documents as they flow through the pipeline. Unlike traditional SQL aggregation approaches that can become unwieldy for complex operations, MongoDB's stage-based design provides clarity, composability, and optimization opportunities that support both real-time analytics and batch processing scenarios.

The Traditional SQL Aggregation Complexity Challenge

Conventional SQL aggregation approaches often become complex and difficult to optimize for advanced data processing requirements:

-- Traditional PostgreSQL complex aggregation with performance limitations

-- Complex sales analysis requiring multiple subqueries and window functions
WITH regional_sales_base AS (
  SELECT 
    r.region_id,
    r.region_name,
    r.country,
    u.user_id,
    u.email,
    u.created_at as user_registration_date,
    o.order_id,
    o.order_date,
    o.total_amount,
    o.discount_amount,
    o.status as order_status,

    -- Complex date calculations
    EXTRACT(YEAR FROM o.order_date) as order_year,
    EXTRACT(MONTH FROM o.order_date) as order_month,
    EXTRACT(QUARTER FROM o.order_date) as order_quarter,

    -- Category analysis requiring joins
    STRING_AGG(DISTINCT p.category, ', ') as product_categories,
    COUNT(DISTINCT oi.product_id) as unique_products_ordered,
    SUM(oi.quantity) as total_items_ordered,
    AVG(oi.unit_price) as avg_item_price,

    -- Complex business logic calculations
    CASE 
      WHEN o.total_amount > 1000 THEN 'high_value'
      WHEN o.total_amount > 500 THEN 'medium_value'
      ELSE 'low_value'
    END as order_value_category,

    -- Window functions for ranking and comparisons
    ROW_NUMBER() OVER (PARTITION BY r.region_id ORDER BY o.total_amount DESC) as region_order_rank,
    PERCENT_RANK() OVER (PARTITION BY r.region_id ORDER BY o.total_amount) as region_percentile_rank,

    -- Running totals and moving averages
    SUM(o.total_amount) OVER (
      PARTITION BY r.region_id 
      ORDER BY o.order_date 
      ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) as seven_day_rolling_total

  FROM regions r
  INNER JOIN users u ON r.region_id = u.region_id
  INNER JOIN orders o ON u.user_id = o.user_id
  INNER JOIN order_items oi ON o.order_id = oi.order_id
  INNER JOIN products p ON oi.product_id = p.product_id
  WHERE 
    o.order_date >= CURRENT_DATE - INTERVAL '2 years'
    AND o.status IN ('completed', 'shipped', 'delivered')
    AND r.country IN ('US', 'CA', 'UK', 'AU', 'DE')
    AND u.status = 'active'
  GROUP BY 
    r.region_id, r.region_name, r.country, u.user_id, u.email, u.created_at,
    o.order_id, o.order_date, o.total_amount, o.discount_amount, o.status
),

-- Nested aggregation for customer segments
customer_segments AS (
  SELECT 
    user_id,
    email,
    region_name,
    country,

    -- Customer value calculations
    COUNT(DISTINCT order_id) as total_orders,
    SUM(total_amount) as lifetime_value,
    AVG(total_amount) as avg_order_value,
    MAX(order_date) as last_order_date,
    MIN(order_date) as first_order_date,

    -- Time-based analysis
    EXTRACT(DAYS FROM (MAX(order_date) - MIN(order_date))) as customer_tenure_days,
    COUNT(DISTINCT order_year) as active_years,
    COUNT(DISTINCT order_quarter) as active_quarters,

    -- Product diversity analysis
    COUNT(DISTINCT unique_products_ordered) as product_diversity,
    STRING_AGG(DISTINCT product_categories, '; ') as all_categories_purchased,

    -- Value segmentation
    CASE 
      WHEN SUM(total_amount) > 5000 AND COUNT(DISTINCT order_id) > 10 THEN 'vip'
      WHEN SUM(total_amount) > 2000 OR COUNT(DISTINCT order_id) > 15 THEN 'loyal'
      WHEN SUM(total_amount) > 500 OR COUNT(DISTINCT order_id) > 5 THEN 'regular'
      ELSE 'occasional'
    END as customer_segment,

    -- Recency analysis
    CASE 
      WHEN MAX(order_date) >= CURRENT_DATE - INTERVAL '30 days' THEN 'active'
      WHEN MAX(order_date) >= CURRENT_DATE - INTERVAL '90 days' THEN 'recent'
      WHEN MAX(order_date) >= CURRENT_DATE - INTERVAL '180 days' THEN 'dormant'
      ELSE 'inactive'
    END as recency_status

  FROM regional_sales_base
  GROUP BY user_id, email, region_name, country
),

-- Regional performance aggregation
regional_performance AS (
  SELECT 
    region_name,
    country,
    order_year,
    order_quarter,

    -- Volume metrics
    COUNT(DISTINCT user_id) as unique_customers,
    COUNT(DISTINCT order_id) as total_orders,
    SUM(total_amount) as total_revenue,
    SUM(total_items_ordered) as total_items_sold,

    -- Average metrics
    AVG(total_amount) as avg_order_value,
    AVG(avg_item_price) as avg_item_price,

    -- Growth calculations requiring complex window functions
    LAG(SUM(total_amount)) OVER (
      PARTITION BY region_name 
      ORDER BY order_year, order_quarter
    ) as previous_quarter_revenue,

    -- Calculate growth rate
    CASE 
      WHEN LAG(SUM(total_amount)) OVER (
        PARTITION BY region_name 
        ORDER BY order_year, order_quarter
      ) > 0 THEN
        ROUND(
          ((SUM(total_amount) - LAG(SUM(total_amount)) OVER (
            PARTITION BY region_name 
            ORDER BY order_year, order_quarter
          )) / LAG(SUM(total_amount)) OVER (
            PARTITION BY region_name 
            ORDER BY order_year, order_quarter
          ) * 100)::numeric, 2
        )
      ELSE NULL
    END as quarter_over_quarter_growth_pct,

    -- Market share analysis
    SUM(total_amount) / SUM(SUM(total_amount)) OVER (PARTITION BY order_year, order_quarter) * 100 as market_share_pct,

    -- Customer distribution by segment
    COUNT(*) FILTER (WHERE order_value_category = 'high_value') as high_value_orders,
    COUNT(*) FILTER (WHERE order_value_category = 'medium_value') as medium_value_orders,
    COUNT(*) FILTER (WHERE order_value_category = 'low_value') as low_value_orders

  FROM regional_sales_base
  GROUP BY region_name, country, order_year, order_quarter
),

-- Final comprehensive analysis
comprehensive_analysis AS (
  SELECT 
    rp.*,

    -- Customer segment distribution
    cs_stats.vip_customers,
    cs_stats.loyal_customers,
    cs_stats.regular_customers,
    cs_stats.occasional_customers,

    -- Recency analysis
    cs_stats.active_customers,
    cs_stats.recent_customers,
    cs_stats.dormant_customers,
    cs_stats.inactive_customers,

    -- Customer value metrics
    cs_stats.avg_customer_lifetime_value,
    cs_stats.avg_customer_tenure_days,

    -- Performance ranking
    DENSE_RANK() OVER (ORDER BY rp.total_revenue DESC) as revenue_rank,
    DENSE_RANK() OVER (ORDER BY rp.unique_customers DESC) as customer_count_rank,
    DENSE_RANK() OVER (ORDER BY rp.avg_order_value DESC) as aov_rank

  FROM regional_performance rp
  LEFT JOIN (
    SELECT 
      region_name,
      country,
      COUNT(*) FILTER (WHERE customer_segment = 'vip') as vip_customers,
      COUNT(*) FILTER (WHERE customer_segment = 'loyal') as loyal_customers,
      COUNT(*) FILTER (WHERE customer_segment = 'regular') as regular_customers,
      COUNT(*) FILTER (WHERE customer_segment = 'occasional') as occasional_customers,
      COUNT(*) FILTER (WHERE recency_status = 'active') as active_customers,
      COUNT(*) FILTER (WHERE recency_status = 'recent') as recent_customers,
      COUNT(*) FILTER (WHERE recency_status = 'dormant') as dormant_customers,
      COUNT(*) FILTER (WHERE recency_status = 'inactive') as inactive_customers,
      AVG(lifetime_value) as avg_customer_lifetime_value,
      AVG(customer_tenure_days) as avg_customer_tenure_days
    FROM customer_segments
    GROUP BY region_name, country
  ) cs_stats ON rp.region_name = cs_stats.region_name AND rp.country = cs_stats.country
)

SELECT 
  region_name,
  country,
  order_year,
  order_quarter,

  -- Core metrics
  unique_customers,
  total_orders,
  ROUND(total_revenue::numeric, 2) as total_revenue,
  ROUND(avg_order_value::numeric, 2) as avg_order_value,

  -- Growth analysis
  COALESCE(quarter_over_quarter_growth_pct, 0) as growth_rate_pct,
  ROUND(market_share_pct::numeric, 2) as market_share_pct,

  -- Customer segments
  COALESCE(vip_customers, 0) as vip_customers,
  COALESCE(loyal_customers, 0) as loyal_customers,
  COALESCE(regular_customers, 0) as regular_customers,

  -- Customer activity
  COALESCE(active_customers, 0) as active_customers,
  COALESCE(dormant_customers + inactive_customers, 0) as at_risk_customers,

  -- Performance indicators
  revenue_rank,
  customer_count_rank,
  aov_rank,

  -- Composite performance score
  CASE 
    WHEN revenue_rank <= 3 AND customer_count_rank <= 5 AND growth_rate_pct > 10 THEN 'excellent'
    WHEN revenue_rank <= 5 AND growth_rate_pct > 5 THEN 'good'
    WHEN revenue_rank <= 10 OR growth_rate_pct > 0 THEN 'average'
    ELSE 'underperforming'
  END as performance_category,

  -- Strategic recommendations
  CASE 
    WHEN at_risk_customers > active_customers * 0.3 THEN 'Focus on customer retention'
    WHEN growth_rate_pct < 0 THEN 'Investigate declining performance'
    WHEN vip_customers = 0 THEN 'Develop VIP customer programs'
    WHEN market_share_pct < 5 THEN 'Expand market presence'
    ELSE 'Maintain current strategies'
  END as recommended_action

FROM comprehensive_analysis
WHERE order_year >= 2023
ORDER BY 
  order_year DESC, 
  order_quarter DESC, 
  total_revenue DESC
LIMIT 50;

-- Problems with traditional SQL aggregation approaches:
-- 1. Complex nested queries that are difficult to understand and maintain
-- 2. Multiple passes through data requiring expensive joins and subqueries
-- 3. Limited optimization opportunities due to rigid query structure
-- 4. Window functions and CTEs create performance bottlenecks with large datasets
-- 5. Difficult to compose and reuse aggregation logic across different queries
-- 6. Limited support for complex data transformations and conditional logic
-- 7. Poor performance with document-oriented or semi-structured data
-- 8. Inflexible aggregation patterns that don't adapt well to changing requirements
-- 9. Complex indexing requirements that may conflict across different aggregation needs
-- 10. Limited support for hierarchical or nested aggregation patterns

MongoDB Aggregation Framework provides powerful, optimizable pipeline processing:

// MongoDB Aggregation Framework - optimized pipeline processing with advanced strategies
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('ecommerce_analytics_platform');

// Advanced aggregation framework optimization and pipeline management system
class MongoAggregationOptimizer {
  constructor(db) {
    this.db = db;
    this.collections = {
      orders: db.collection('orders'),
      users: db.collection('users'),
      products: db.collection('products'),
      regions: db.collection('regions'),
      analytics: db.collection('analytics_cache')
    };

    this.pipelineCache = new Map();
    this.performanceMetrics = new Map();
    this.optimizationStrategies = {
      earlyFiltering: true,
      indexHints: true,
      stageReordering: true,
      memoryOptimization: true,
      incrementalProcessing: true
    };
  }

  async buildOptimizedSalesAnalysisPipeline(options = {}) {
    console.log('Building optimized sales analysis aggregation pipeline...');

    const {
      dateRange = { start: new Date(Date.now() - 365 * 24 * 60 * 60 * 1000), end: new Date() },
      regions = [],
      includeCustomerSegmentation = true,
      includeProductAnalysis = true,
      includeTemporalAnalysis = true,
      optimizationLevel = 'aggressive'
    } = options;

    // Stage 1: Early filtering for maximum performance (always first)
    const matchStage = {
      $match: {
        order_date: { 
          $gte: dateRange.start, 
          $lte: dateRange.end 
        },
        status: { $in: ['completed', 'shipped', 'delivered'] },
        ...(regions.length > 0 && { 'user.region': { $in: regions } }),
        total_amount: { $gt: 0 } // Exclude zero-value orders early
      }
    };

    // Stage 2: Lookup optimizations with targeted field selection
    const userLookupStage = {
      $lookup: {
        from: 'users',
        localField: 'user_id',
        foreignField: '_id',
        as: 'user_data',
        pipeline: [ // Use pipeline to reduce data transfer
          {
            $match: { 
              status: 'active',
              ...(regions.length > 0 && { region: { $in: regions } })
            }
          },
          {
            $project: {
              _id: 1,
              email: 1,
              region: 1,
              country: 1,
              registration_date: 1,
              customer_segment: 1
            }
          }
        ]
      }
    };

    // Stage 3: Unwind and reshape data efficiently
    const unwindUserStage = { $unwind: '$user_data' };

    // Stage 4: Add computed fields for analysis
    const addFieldsStage = {
      $addFields: {
        // Date calculations optimized for indexing
        order_year: { $year: '$order_date' },
        order_month: { $month: '$order_date' },
        order_quarter: { 
          $ceil: { $divide: [{ $month: '$order_date' }, 3] }
        },
        order_day_of_week: { $dayOfWeek: '$order_date' },

        // Business logic calculations
        order_value_category: {
          $switch: {
            branches: [
              { case: { $gte: ['$total_amount', 1000] }, then: 'high_value' },
              { case: { $gte: ['$total_amount', 500] }, then: 'medium_value' }
            ],
            default: 'low_value'
          }
        },

        // Profit margin calculations
        profit_margin: {
          $multiply: [
            { $divide: [
              { $subtract: ['$total_amount', '$cost_amount'] },
              '$total_amount'
            ]},
            100
          ]
        },

        // Discount analysis
        discount_percentage: {
          $cond: {
            if: { $gt: ['$total_amount', 0] },
            then: { 
              $multiply: [
                { $divide: ['$discount_amount', { $add: ['$total_amount', '$discount_amount'] }] },
                100
              ]
            },
            else: 0
          }
        },

        // Customer tenure at time of order
        customer_tenure_days: {
          $divide: [
            { $subtract: ['$order_date', '$user_data.registration_date'] },
            86400000 // Convert milliseconds to days
          ]
        }
      }
    };

    // Stage 5: Product analysis lookup (conditional)
    const productAnalysisStages = includeProductAnalysis ? [
      {
        $lookup: {
          from: 'order_items',
          localField: '_id',
          foreignField: 'order_id',
          as: 'order_items',
          pipeline: [
            {
              $lookup: {
                from: 'products',
                localField: 'product_id',
                foreignField: '_id',
                as: 'product',
                pipeline: [
                  {
                    $project: {
                      name: 1,
                      category: 1,
                      sub_category: 1,
                      brand: 1,
                      cost_price: 1,
                      margin_percentage: 1
                    }
                  }
                ]
              }
            },
            { $unwind: '$product' },
            {
              $group: {
                _id: '$order_id',
                product_count: { $sum: 1 },
                total_quantity: { $sum: '$quantity' },
                categories: { $addToSet: '$product.category' },
                brands: { $addToSet: '$product.brand' },
                avg_item_margin: { $avg: '$product.margin_percentage' }
              }
            }
          ]
        }
      },
      { $unwind: { path: '$order_items', preserveNullAndEmptyArrays: true } }
    ] : [];

    // Stage 6: Main aggregation pipeline for comprehensive analysis
    const groupingStage = {
      $group: {
        _id: {
          region: '$user_data.region',
          country: '$user_data.country',
          year: '$order_year',
          quarter: '$order_quarter',
          ...(includeTemporalAnalysis && {
            month: '$order_month',
            day_of_week: '$order_day_of_week'
          })
        },

        // Volume metrics
        total_orders: { $sum: 1 },
        unique_customers: { $addToSet: '$user_id' },
        total_revenue: { $sum: '$total_amount' },
        total_items_sold: { $sum: { $ifNull: ['$order_items.total_quantity', 0] } },

        // Value metrics
        avg_order_value: { $avg: '$total_amount' },
        median_order_value: { $median: { input: '$total_amount', method: 'approximate' } },
        max_order_value: { $max: '$total_amount' },
        min_order_value: { $min: '$total_amount' },

        // Profitability metrics
        total_profit: { $sum: { $multiply: ['$total_amount', { $divide: ['$profit_margin', 100] }] } },
        avg_profit_margin: { $avg: '$profit_margin' },

        // Discount analysis
        total_discounts_given: { $sum: '$discount_amount' },
        avg_discount_percentage: { $avg: '$discount_percentage' },
        orders_with_discounts: { 
          $sum: { $cond: [{ $gt: ['$discount_amount', 0] }, 1, 0] }
        },

        // Customer value distribution
        high_value_orders: { 
          $sum: { $cond: [{ $eq: ['$order_value_category', 'high_value'] }, 1, 0] }
        },
        medium_value_orders: {
          $sum: { $cond: [{ $eq: ['$order_value_category', 'medium_value'] }, 1, 0] }
        },
        low_value_orders: {
          $sum: { $cond: [{ $eq: ['$order_value_category', 'low_value'] }, 1, 0] }
        },

        // Product diversity (when product analysis enabled)
        ...(includeProductAnalysis && {
          unique_categories: { $addToSet: '$order_items.categories' },
          unique_brands: { $addToSet: '$order_items.brands' },
          avg_products_per_order: { $avg: '$order_items.product_count' },
          avg_item_margin: { $avg: '$order_items.avg_item_margin' }
        }),

        // Customer tenure analysis
        avg_customer_tenure: { $avg: '$customer_tenure_days' },
        new_customer_orders: {
          $sum: { $cond: [{ $lte: ['$customer_tenure_days', 30] }, 1, 0] }
        },

        // Sample data for detailed analysis
        sample_order_dates: { $push: '$order_date' },
        sample_customer_segments: { $push: '$user_data.customer_segment' }
      }
    };

    // Stage 7: Post-processing calculations
    const postProcessingStage = {
      $addFields: {
        // Customer metrics
        unique_customer_count: { $size: '$unique_customers' },
        orders_per_customer: { 
          $divide: ['$total_orders', { $size: '$unique_customers' }]
        },

        // Revenue per customer
        revenue_per_customer: {
          $divide: ['$total_revenue', { $size: '$unique_customers' }]
        },

        // Profit margins
        profit_margin_percentage: {
          $cond: {
            if: { $gt: ['$total_revenue', 0] },
            then: { $multiply: [{ $divide: ['$total_profit', '$total_revenue'] }, 100] },
            else: 0
          }
        },

        // Discount impact
        discount_rate: {
          $cond: {
            if: { $gt: ['$total_orders', 0] },
            then: { $multiply: [{ $divide: ['$orders_with_discounts', '$total_orders'] }, 100] },
            else: 0
          }
        },

        // Order value distribution
        high_value_percentage: {
          $multiply: [{ $divide: ['$high_value_orders', '$total_orders'] }, 100]
        },

        // New vs returning customer ratio
        new_customer_percentage: {
          $multiply: [{ $divide: ['$new_customer_orders', '$total_orders'] }, 100]
        },

        // Category diversity (when product analysis enabled)
        ...(includeProductAnalysis && {
          category_diversity_score: {
            $size: { $reduce: {
              input: '$unique_categories',
              initialValue: [],
              in: { $setUnion: ['$$value', '$$this'] }
            }}
          }
        }),

        // Performance indicators
        performance_score: {
          $add: [
            { $multiply: [{ $ln: { $add: ['$total_revenue', 1] } }, 0.3] },
            { $multiply: ['$avg_profit_margin', 0.2] },
            { $multiply: [{ $ln: { $add: ['$unique_customer_count', 1] } }, 0.3] },
            { $multiply: ['$orders_per_customer', 0.2] }
          ]
        }
      }
    };

    // Stage 8: Growth analysis using window operations
    const windowAnalysisStage = {
      $setWindowFields: {
        partitionBy: { region: '$_id.region', country: '$_id.country' },
        sortBy: { year: '$_id.year', quarter: '$_id.quarter' },
        output: {
          previous_quarter_revenue: {
            $shift: {
              output: '$total_revenue',
              by: -1
            }
          },
          revenue_trend: {
            $linearFill: '$total_revenue'
          },
          quarter_rank: {
            $rank: {}
          },
          rolling_avg_revenue: {
            $avg: '$total_revenue',
            window: {
              range: [-3, 0],
              unit: 'position'
            }
          }
        }
      }
    };

    // Stage 9: Growth calculations
    const growthCalculationStage = {
      $addFields: {
        quarter_over_quarter_growth: {
          $cond: {
            if: { $and: [
              { $ne: ['$previous_quarter_revenue', null] },
              { $gt: ['$previous_quarter_revenue', 0] }
            ]},
            then: {
              $multiply: [
                { $divide: [
                  { $subtract: ['$total_revenue', '$previous_quarter_revenue'] },
                  '$previous_quarter_revenue'
                ]},
                100
              ]
            },
            else: null
          }
        },

        performance_vs_avg: {
          $multiply: [
            { $divide: [
              { $subtract: ['$total_revenue', '$rolling_avg_revenue'] },
              '$rolling_avg_revenue'
            ]},
            100
          ]
        },

        growth_classification: {
          $switch: {
            branches: [
              { case: { $gte: ['$quarter_over_quarter_growth', 20] }, then: 'high_growth' },
              { case: { $gte: ['$quarter_over_quarter_growth', 10] }, then: 'moderate_growth' },
              { case: { $gte: ['$quarter_over_quarter_growth', 0] }, then: 'stable' },
              { case: { $gte: ['$quarter_over_quarter_growth', -10] }, then: 'declining' }
            ],
            default: 'rapidly_declining'
          }
        }
      }
    };

    // Stage 10: Final projections and cleanup
    const finalProjectionStage = {
      $project: {
        // Location data
        region: '$_id.region',
        country: '$_id.country',
        year: '$_id.year',
        quarter: '$_id.quarter',
        ...(includeTemporalAnalysis && {
          month: '$_id.month',
          day_of_week: '$_id.day_of_week'
        }),

        // Core metrics (rounded for presentation)
        total_orders: 1,
        unique_customer_count: 1,
        total_revenue: { $round: ['$total_revenue', 2] },
        total_profit: { $round: ['$total_profit', 2] },

        // Averages and rates
        avg_order_value: { $round: ['$avg_order_value', 2] },
        median_order_value: { $round: ['$median_order_value', 2] },
        revenue_per_customer: { $round: ['$revenue_per_customer', 2] },
        orders_per_customer: { $round: ['$orders_per_customer', 2] },

        // Percentages
        profit_margin_percentage: { $round: ['$profit_margin_percentage', 2] },
        discount_rate: { $round: ['$discount_rate', 2] },
        high_value_percentage: { $round: ['$high_value_percentage', 2] },
        new_customer_percentage: { $round: ['$new_customer_percentage', 2] },

        // Growth metrics
        quarter_over_quarter_growth: { $round: ['$quarter_over_quarter_growth', 2] },
        performance_vs_avg: { $round: ['$performance_vs_avg', 2] },
        growth_classification: 1,

        // Performance indicators
        performance_score: { $round: ['$performance_score', 2] },
        quarter_rank: 1,

        // Product analysis (conditional)
        ...(includeProductAnalysis && {
          category_diversity_score: 1,
          avg_products_per_order: { $round: ['$avg_products_per_order', 2] },
          avg_item_margin: { $round: ['$avg_item_margin', 2] }
        }),

        // Strategic indicators
        strategic_priority: {
          $switch: {
            branches: [
              { 
                case: { 
                  $and: [
                    { $gte: ['$performance_score', 15] },
                    { $gte: ['$quarter_over_quarter_growth', 10] }
                  ]
                }, 
                then: 'high_potential' 
              },
              { 
                case: { 
                  $and: [
                    { $gte: ['$total_revenue', 50000] },
                    { $gte: ['$profit_margin_percentage', 15] }
                  ]
                }, 
                then: 'cash_cow' 
              },
              { 
                case: { $lte: ['$quarter_over_quarter_growth', -10] }, 
                then: 'needs_attention' 
              }
            ],
            default: 'monitor'
          }
        }
      }
    };

    // Stage 11: Sorting for optimal presentation
    const sortStage = {
      $sort: {
        year: -1,
        quarter: -1,
        total_revenue: -1,
        performance_score: -1
      }
    };

    // Build complete optimized pipeline
    const pipeline = [
      matchStage,
      userLookupStage,
      unwindUserStage,
      addFieldsStage,
      ...productAnalysisStages,
      groupingStage,
      postProcessingStage,
      windowAnalysisStage,
      growthCalculationStage,
      finalProjectionStage,
      sortStage
    ];

    // Add performance optimization hints based on level
    const optimizedPipeline = await this.applyOptimizationStrategies(pipeline, optimizationLevel);

    console.log(`Optimized aggregation pipeline built with ${optimizedPipeline.length} stages`);
    return optimizedPipeline;
  }

  async applyOptimizationStrategies(pipeline, optimizationLevel = 'standard') {
    console.log(`Applying ${optimizationLevel} optimization strategies...`);

    let optimizedPipeline = [...pipeline];

    if (this.optimizationStrategies.earlyFiltering) {
      // Ensure filtering stages are as early as possible
      optimizedPipeline = this.moveFilteringStagesEarly(optimizedPipeline);
    }

    if (this.optimizationStrategies.indexHints) {
      // Add index hints for better query planning
      optimizedPipeline = this.addIndexHints(optimizedPipeline);
    }

    if (this.optimizationStrategies.stageReordering && optimizationLevel === 'aggressive') {
      // Reorder stages for optimal performance
      optimizedPipeline = this.reorderPipelineStages(optimizedPipeline);
    }

    if (this.optimizationStrategies.memoryOptimization) {
      // Add memory usage optimizations
      optimizedPipeline = this.optimizeMemoryUsage(optimizedPipeline);
    }

    return optimizedPipeline;
  }

  moveFilteringStagesEarly(pipeline) {
    const filterStages = [];
    const otherStages = [];

    pipeline.forEach(stage => {
      if (stage.$match) {
        filterStages.push(stage);
      } else {
        otherStages.push(stage);
      }
    });

    return [...filterStages, ...otherStages];
  }

  addIndexHints(pipeline) {
    // Add allowDiskUse and other performance hints
    const firstStage = pipeline[0];

    if (firstStage && firstStage.$match) {
      // Add hint for optimal index usage
      pipeline.unshift({
        $indexStats: {}
      });
    }

    return pipeline;
  }

  optimizeMemoryUsage(pipeline) {
    // Add memory optimization settings
    return pipeline.map(stage => {
      if (stage.$group || stage.$sort) {
        return {
          ...stage,
          allowDiskUse: true
        };
      }
      return stage;
    });
  }

  async executeOptimizedAggregation(pipeline, options = {}) {
    console.log('Executing optimized aggregation pipeline...');

    const {
      collection = 'orders',
      explain = false,
      allowDiskUse = true,
      maxTimeMS = 300000, // 5 minutes
      batchSize = 1000
    } = options;

    const targetCollection = this.collections[collection];
    const startTime = Date.now();

    try {
      if (explain) {
        // Return execution plan for analysis
        const explainResult = await targetCollection.aggregate(pipeline).explain('executionStats');
        return {
          success: true,
          explain: explainResult,
          executionTimeMs: Date.now() - startTime
        };
      }

      // Execute aggregation with options
      const cursor = targetCollection.aggregate(pipeline, {
        allowDiskUse,
        maxTimeMS,
        batchSize,
        comment: `Optimized aggregation - ${new Date().toISOString()}`
      });

      const results = await cursor.toArray();
      const executionTime = Date.now() - startTime;

      // Cache pipeline performance metrics
      const pipelineHash = this.generatePipelineHash(pipeline);
      this.performanceMetrics.set(pipelineHash, {
        executionTimeMs: executionTime,
        resultCount: results.length,
        timestamp: new Date(),
        collection: collection
      });

      console.log(`Aggregation completed in ${executionTime}ms, returned ${results.length} documents`);

      return {
        success: true,
        results: results,
        executionTimeMs: executionTime,
        resultCount: results.length,
        pipelineHash: pipelineHash
      };

    } catch (error) {
      console.error('Aggregation execution failed:', error);
      return {
        success: false,
        error: error.message,
        executionTimeMs: Date.now() - startTime
      };
    }
  }

  async buildCustomerSegmentationPipeline(options = {}) {
    console.log('Building advanced customer segmentation pipeline...');

    const {
      lookbackMonths = 12,
      includeProductAffinity = true,
      includeGeographicAnalysis = true,
      segmentationModel = 'rfm' // recency, frequency, monetary
    } = options;

    const lookbackDate = new Date();
    lookbackDate.setMonth(lookbackDate.getMonth() - lookbackMonths);

    const pipeline = [
      // Stage 1: Filter active users and recent data
      {
        $match: {
          status: 'active',
          created_at: { $lte: new Date() },
          deleted_at: { $exists: false }
        }
      },

      // Stage 2: Join with order data
      {
        $lookup: {
          from: 'orders',
          localField: '_id',
          foreignField: 'user_id',
          as: 'orders',
          pipeline: [
            {
              $match: {
                order_date: { $gte: lookbackDate },
                status: { $in: ['completed', 'shipped', 'delivered'] },
                total_amount: { $gt: 0 }
              }
            },
            {
              $project: {
                order_date: 1,
                total_amount: 1,
                discount_amount: 1,
                items: 1,
                product_categories: 1
              }
            }
          ]
        }
      },

      // Stage 3: Calculate RFM metrics
      {
        $addFields: {
          // Recency: Days since last purchase
          recency_days: {
            $cond: {
              if: { $gt: [{ $size: '$orders' }, 0] },
              then: {
                $divide: [
                  { $subtract: [
                    new Date(),
                    { $max: '$orders.order_date' }
                  ]},
                  86400000 // Convert to days
                ]
              },
              else: 9999 // Very high number for users with no orders
            }
          },

          // Frequency: Number of orders
          frequency: { $size: '$orders' },

          // Monetary: Total amount spent
          monetary_value: { $sum: '$orders.total_amount' },

          // Additional metrics
          avg_order_value: { $avg: '$orders.total_amount' },
          total_discount_used: { $sum: '$orders.discount_amount' },
          order_date_range: {
            $cond: {
              if: { $gt: [{ $size: '$orders' }, 0] },
              then: {
                $divide: [
                  { $subtract: [
                    { $max: '$orders.order_date' },
                    { $min: '$orders.order_date' }
                  ]},
                  86400000
                ]
              },
              else: 0
            }
          }
        }
      },

      // Stage 4: Product affinity analysis (conditional)
      ...(includeProductAffinity ? [
        {
          $addFields: {
            product_categories: {
              $reduce: {
                input: '$orders.product_categories',
                initialValue: [],
                in: { $setUnion: ['$$value', '$$this'] }
              }
            },
            category_diversity: {
              $size: {
                $reduce: {
                  input: '$orders.product_categories',
                  initialValue: [],
                  in: { $setUnion: ['$$value', '$$this'] }
                }
              }
            }
          }
        }
      ] : []),

      // Stage 5: Calculate percentiles for RFM scoring
      {
        $setWindowFields: {
          sortBy: { recency_days: 1 },
          output: {
            recency_percentile: {
              $percentRank: {
                input: '$recency_days',
                range: [0, 1]
              }
            }
          }
        }
      },
      {
        $setWindowFields: {
          sortBy: { frequency: 1 },
          output: {
            frequency_percentile: {
              $percentRank: {
                input: '$frequency',
                range: [0, 1]
              }
            }
          }
        }
      },
      {
        $setWindowFields: {
          sortBy: { monetary_value: 1 },
          output: {
            monetary_percentile: {
              $percentRank: {
                input: '$monetary_value',
                range: [0, 1]
              }
            }
          }
        }
      },

      // Stage 6: Calculate RFM scores
      {
        $addFields: {
          // Invert recency score (lower days = higher score)
          recency_score: {
            $ceil: { $multiply: [{ $subtract: [1, '$recency_percentile'] }, 5] }
          },
          frequency_score: {
            $ceil: { $multiply: ['$frequency_percentile', 5] }
          },
          monetary_score: {
            $ceil: { $multiply: ['$monetary_percentile', 5] }
          }
        }
      },

      // Stage 7: Generate customer segments
      {
        $addFields: {
          rfm_score: {
            $concat: [
              { $toString: '$recency_score' },
              { $toString: '$frequency_score' },
              { $toString: '$monetary_score' }
            ]
          },

          // Comprehensive customer segment classification
          customer_segment: {
            $switch: {
              branches: [
                {
                  case: {
                    $and: [
                      { $gte: ['$recency_score', 4] },
                      { $gte: ['$frequency_score', 4] },
                      { $gte: ['$monetary_score', 4] }
                    ]
                  },
                  then: 'champions'
                },
                {
                  case: {
                    $and: [
                      { $gte: ['$recency_score', 3] },
                      { $gte: ['$frequency_score', 3] },
                      { $gte: ['$monetary_score', 4] }
                    ]
                  },
                  then: 'loyal_customers'
                },
                {
                  case: {
                    $and: [
                      { $gte: ['$recency_score', 4] },
                      { $lte: ['$frequency_score', 2] },
                      { $gte: ['$monetary_score', 3] }
                    ]
                  },
                  then: 'potential_loyalists'
                },
                {
                  case: {
                    $and: [
                      { $gte: ['$recency_score', 4] },
                      { $lte: ['$frequency_score', 1] },
                      { $lte: ['$monetary_score', 2] }
                    ]
                  },
                  then: 'new_customers'
                },
                {
                  case: {
                    $and: [
                      { $lte: ['$recency_score', 2] },
                      { $gte: ['$frequency_score', 3] },
                      { $gte: ['$monetary_score', 3] }
                    ]
                  },
                  then: 'at_risk'
                },
                {
                  case: {
                    $and: [
                      { $lte: ['$recency_score', 2] },
                      { $lte: ['$frequency_score', 2] },
                      { $gte: ['$monetary_score', 3] }
                    ]
                  },
                  then: 'cannot_lose_them'
                },
                {
                  case: {
                    $and: [
                      { $lte: ['$recency_score', 2] },
                      { $lte: ['$frequency_score', 2] },
                      { $lte: ['$monetary_score', 2] }
                    ]
                  },
                  then: 'hibernating'
                }
              ],
              default: 'promising'
            }
          },

          // Customer lifetime value prediction
          predicted_clv: {
            $multiply: [
              '$avg_order_value',
              '$frequency',
              { $divide: ['$order_date_range', 365] }, // Annualized frequency
              { $subtract: [5, { $divide: ['$recency_days', 73] }] } // Recency factor
            ]
          },

          // Churn risk score
          churn_risk_score: {
            $cond: {
              if: { $gt: ['$recency_days', 90] },
              then: {
                $add: [
                  { $multiply: ['$recency_days', 0.01] },
                  { $multiply: [{ $subtract: [5, '$frequency_score'] }, 0.2] }
                ]
              },
              else: 0.1
            }
          }
        }
      },

      // Stage 8: Final projection with insights
      {
        $project: {
          _id: 1,
          email: 1,
          region: 1,
          country: 1,
          registration_date: 1,

          // RFM metrics
          recency_days: { $round: ['$recency_days', 0] },
          frequency: 1,
          monetary_value: { $round: ['$monetary_value', 2] },
          avg_order_value: { $round: ['$avg_order_value', 2] },

          // RFM scores
          recency_score: 1,
          frequency_score: 1,
          monetary_score: 1,
          rfm_score: 1,

          // Segmentation
          customer_segment: 1,
          predicted_clv: { $round: ['$predicted_clv', 2] },
          churn_risk_score: { $round: ['$churn_risk_score', 2] },

          // Additional insights
          ...(includeProductAffinity && {
            category_diversity: 1,
            preferred_categories: '$product_categories'
          }),

          // Actionable recommendations
          recommended_action: {
            $switch: {
              branches: [
                { case: { $eq: ['$customer_segment', 'champions'] }, then: 'Reward and upsell' },
                { case: { $eq: ['$customer_segment', 'loyal_customers'] }, then: 'Maintain engagement' },
                { case: { $eq: ['$customer_segment', 'potential_loyalists'] }, then: 'Increase frequency' },
                { case: { $eq: ['$customer_segment', 'new_customers'] }, then: 'Onboarding focus' },
                { case: { $eq: ['$customer_segment', 'at_risk'] }, then: 'Re-engagement campaign' },
                { case: { $eq: ['$customer_segment', 'cannot_lose_them'] }, then: 'Win-back strategy' },
                { case: { $eq: ['$customer_segment', 'hibernating'] }, then: 'Reactivation offer' }
              ],
              default: 'General nurturing'
            }
          }
        }
      },

      // Stage 9: Sort by value for prioritization
      {
        $sort: {
          customer_segment: 1,
          predicted_clv: -1,
          monetary_value: -1
        }
      }
    ];

    console.log('Customer segmentation pipeline built successfully');
    return pipeline;
  }

  async performPipelineBenchmarking(pipelines, options = {}) {
    console.log('Performing comprehensive pipeline benchmarking...');

    const {
      iterations = 3,
      includeExplainPlans = true,
      warmupRuns = 1
    } = options;

    const benchmarkResults = [];

    for (const [pipelineName, pipeline] of Object.entries(pipelines)) {
      console.log(`Benchmarking pipeline: ${pipelineName}`);

      const pipelineResults = {
        name: pipelineName,
        stages: pipeline.length,
        iterations: [],
        avgExecutionTime: 0,
        minExecutionTime: Infinity,
        maxExecutionTime: 0,
        explainPlan: null
      };

      // Warmup runs
      for (let w = 0; w < warmupRuns; w++) {
        await this.executeOptimizedAggregation(pipeline, { collection: 'orders' });
      }

      // Benchmark iterations
      for (let i = 0; i < iterations; i++) {
        const result = await this.executeOptimizedAggregation(pipeline, { 
          collection: 'orders',
          explain: i === 0 && includeExplainPlans
        });

        if (result.success) {
          if (result.explain) {
            pipelineResults.explainPlan = result.explain;
          }

          if (result.executionTimeMs) {
            pipelineResults.iterations.push(result.executionTimeMs);
            pipelineResults.minExecutionTime = Math.min(pipelineResults.minExecutionTime, result.executionTimeMs);
            pipelineResults.maxExecutionTime = Math.max(pipelineResults.maxExecutionTime, result.executionTimeMs);
          }
        }
      }

      // Calculate averages
      if (pipelineResults.iterations.length > 0) {
        pipelineResults.avgExecutionTime = pipelineResults.iterations.reduce((sum, time) => sum + time, 0) / pipelineResults.iterations.length;
      }

      benchmarkResults.push(pipelineResults);
    }

    // Sort by performance
    benchmarkResults.sort((a, b) => a.avgExecutionTime - b.avgExecutionTime);

    console.log('Pipeline benchmarking completed');
    return benchmarkResults;
  }

  generatePipelineHash(pipeline) {
    const pipelineString = JSON.stringify(pipeline, Object.keys(pipeline).sort());
    return require('crypto').createHash('md5').update(pipelineString).digest('hex');
  }

  async createOptimalIndexes() {
    console.log('Creating optimal indexes for aggregation performance...');

    const orders = this.collections.orders;
    const users = this.collections.users;

    try {
      // Compound indexes for common aggregation patterns
      await orders.createIndex({ 
        order_date: -1, 
        status: 1, 
        user_id: 1 
      }, { background: true });

      await orders.createIndex({ 
        user_id: 1, 
        order_date: -1, 
        total_amount: -1 
      }, { background: true });

      await orders.createIndex({ 
        status: 1, 
        order_date: -1 
      }, { background: true });

      await users.createIndex({ 
        status: 1, 
        region: 1, 
        created_at: -1 
      }, { background: true });

      console.log('Optimal indexes created successfully');
    } catch (error) {
      console.warn('Index creation warning:', error.message);
    }
  }
}

// Benefits of MongoDB Aggregation Framework Optimization:
// - Pipeline-based design enables clear, composable data transformations
// - Automatic query optimization and index utilization across pipeline stages  
// - Memory and performance optimizations with allowDiskUse and stage reordering
// - Advanced window functions and statistical operations for complex analysis
// - Flexible stage composition that adapts to changing analytical requirements
// - Integration with MongoDB's distributed architecture for horizontal scaling
// - Real-time and batch processing capabilities with consistent optimization patterns
// - Rich data transformation functions supporting nested documents and arrays
// - Performance monitoring and explain plan analysis for continuous optimization
// - SQL-compatible aggregation patterns through QueryLeaf integration

module.exports = {
  MongoAggregationOptimizer
};

Understanding MongoDB Aggregation Framework Architecture

Advanced Pipeline Optimization Strategies and Performance Tuning

Implement sophisticated aggregation optimization patterns for production-scale analytics:

// Advanced aggregation optimization patterns and performance monitoring
class ProductionAggregationManager {
  constructor(db) {
    this.db = db;
    this.pipelineLibrary = new Map();
    this.performanceBaselines = new Map();
    this.optimizationRules = [
      'early_filtering',
      'index_utilization', 
      'memory_optimization',
      'stage_reordering',
      'parallel_processing'
    ];
  }

  async buildRealtimeAnalyticsPipeline(analyticsConfig) {
    console.log('Building real-time analytics aggregation pipeline...');

    const {
      timeWindow = '1h',
      updateInterval = '5m',
      includeTrends = true,
      includeAnomalyDetection = true,
      alertThresholds = {}
    } = analyticsConfig;

    // Real-time metrics pipeline with change stream integration
    const realtimePipeline = [
      {
        $match: {
          operationType: { $in: ['insert', 'update'] },
          'fullDocument.order_date': {
            $gte: new Date(Date.now() - this.parseTimeWindow(timeWindow))
          },
          'fullDocument.status': { $in: ['completed', 'shipped', 'delivered'] }
        }
      },

      {
        $replaceRoot: {
          newRoot: '$fullDocument'
        }
      },

      {
        $group: {
          _id: {
            $dateTrunc: {
              date: '$order_date',
              unit: 'minute',
              binSize: parseInt(updateInterval)
            }
          },

          // Real-time metrics
          order_count: { $sum: 1 },
          revenue: { $sum: '$total_amount' },
          avg_order_value: { $avg: '$total_amount' },
          unique_customers: { $addToSet: '$user_id' },

          // Geographic distribution
          regions: { $addToSet: '$region' },
          countries: { $addToSet: '$country' },

          // Product performance
          product_categories: { $push: '$product_categories' },

          // Anomaly detection data points
          revenue_samples: { $push: '$total_amount' },
          order_timestamps: { $push: '$order_date' }
        }
      },

      {
        $addFields: {
          time_bucket: '$_id',
          unique_customer_count: { $size: '$unique_customers' },
          region_diversity: { $size: '$regions' },

          // Statistical measures for anomaly detection
          revenue_std: { $stdDevPop: '$revenue_samples' },
          revenue_median: { $median: { input: '$revenue_samples', method: 'approximate' } },

          // Performance indicators
          orders_per_minute: { 
            $divide: ['$order_count', parseInt(updateInterval)]
          },
          revenue_per_minute: {
            $divide: ['$revenue', parseInt(updateInterval)]
          }
        }
      },

      // Trend analysis using window operations
      {
        $setWindowFields: {
          sortBy: { time_bucket: 1 },
          output: {
            revenue_trend: {
              $linearFill: '$revenue'
            },
            moving_avg_revenue: {
              $avg: '$revenue',
              window: {
                range: [-6, 0], // 7-period moving average
                unit: 'position'
              }
            },
            revenue_change: {
              $subtract: [
                '$revenue',
                {
                  $shift: {
                    output: '$revenue',
                    by: -1
                  }
                }
              ]
            }
          }
        }
      },

      // Anomaly detection
      ...(includeAnomalyDetection ? [
        {
          $addFields: {
            anomaly_score: {
              $abs: {
                $divide: [
                  { $subtract: ['$revenue', '$moving_avg_revenue'] },
                  { $add: ['$revenue_std', 1] }
                ]
              }
            },

            is_anomaly: {
              $gt: [
                {
                  $abs: {
                    $divide: [
                      { $subtract: ['$revenue', '$moving_avg_revenue'] },
                      { $add: ['$revenue_std', 1] }
                    ]
                  }
                },
                2 // 2 standard deviations
              ]
            },

            performance_alert: {
              $cond: {
                if: {
                  $or: [
                    { $lt: ['$revenue', alertThresholds.minRevenue || 0] },
                    { $gt: ['$orders_per_minute', alertThresholds.maxOrderRate || 1000] },
                    { $lt: ['$avg_order_value', alertThresholds.minAOV || 0] }
                  ]
                },
                then: true,
                else: false
              }
            }
          }
        }
      ] : []),

      {
        $project: {
          time_bucket: 1,
          order_count: 1,
          revenue: { $round: ['$revenue', 2] },
          avg_order_value: { $round: ['$avg_order_value', 2] },
          unique_customer_count: 1,
          region_diversity: 1,

          // Trend indicators
          ...(includeTrends && {
            revenue_change: { $round: ['$revenue_change', 2] },
            moving_avg_revenue: { $round: ['$moving_avg_revenue', 2] },
            trend_direction: {
              $switch: {
                branches: [
                  { case: { $gt: ['$revenue_change', 0] }, then: 'up' },
                  { case: { $lt: ['$revenue_change', 0] }, then: 'down' }
                ],
                default: 'stable'
              }
            }
          }),

          // Alert information
          ...(includeAnomalyDetection && {
            anomaly_score: { $round: ['$anomaly_score', 3] },
            is_anomaly: 1,
            performance_alert: 1
          }),

          // Timestamp for real-time tracking
          computed_at: new Date()
        }
      },

      {
        $sort: { time_bucket: -1 }
      },

      {
        $limit: 100 // Keep recent data points
      }
    ];

    return realtimePipeline;
  }

  async optimizePipelineForScale(pipeline, scaleRequirements) {
    console.log('Optimizing pipeline for scale requirements...');

    const {
      expectedDocuments = 1000000,
      maxExecutionTime = 60000,
      memoryLimit = '100M',
      parallelization = true
    } = scaleRequirements;

    let optimizedPipeline = [...pipeline];

    // 1. Add early filtering based on data volume
    if (expectedDocuments > 100000) {
      optimizedPipeline = this.addEarlyFiltering(optimizedPipeline);
    }

    // 2. Optimize grouping operations for large datasets
    optimizedPipeline = this.optimizeGroupingStages(optimizedPipeline, expectedDocuments);

    // 3. Add memory management directives
    optimizedPipeline = this.addMemoryManagement(optimizedPipeline, memoryLimit);

    // 4. Enable parallelization where possible
    if (parallelization) {
      optimizedPipeline = this.enableParallelProcessing(optimizedPipeline);
    }

    // 5. Add performance monitoring
    optimizedPipeline = this.addPerformanceMonitoring(optimizedPipeline);

    return optimizedPipeline;
  }

  addEarlyFiltering(pipeline) {
    // Move all $match stages to the beginning
    const matchStages = pipeline.filter(stage => stage.$match);
    const otherStages = pipeline.filter(stage => !stage.$match);

    return [...matchStages, ...otherStages];
  }

  optimizeGroupingStages(pipeline, expectedDocuments) {
    return pipeline.map(stage => {
      if (stage.$group && expectedDocuments > 500000) {
        return {
          ...stage,
          allowDiskUse: true,
          // Use approximate algorithms for large datasets
          ...(stage.$group.$median && {
            $group: {
              ...stage.$group,
              $median: {
                ...stage.$group.$median,
                method: 'approximate'
              }
            }
          })
        };
      }
      return stage;
    });
  }

  addMemoryManagement(pipeline, memoryLimit) {
    return pipeline.map((stage, index) => {
      // Add memory management for memory-intensive stages
      if (stage.$sort || stage.$group || stage.$bucket) {
        return {
          ...stage,
          allowDiskUse: true,
          maxMemoryUsageBytes: this.parseMemoryLimit(memoryLimit)
        };
      }
      return stage;
    });
  }

  parseMemoryLimit(limit) {
    const units = { M: 1024 * 1024, G: 1024 * 1024 * 1024 };
    const match = limit.match(/(\d+)([MG])/);
    return match ? parseInt(match[1]) * units[match[2]] : 100 * 1024 * 1024;
  }

  parseTimeWindow(timeWindow) {
    const units = { m: 60000, h: 3600000, d: 86400000 };
    const match = timeWindow.match(/(\d+)([mhd])/);
    return match ? parseInt(match[1]) * units[match[2]] : 3600000;
  }
}

SQL-Style Aggregation Optimization with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB aggregation optimization and complex analytics:

-- QueryLeaf aggregation framework optimization with SQL-familiar patterns

-- Advanced sales analysis with optimized aggregation pipeline
WITH regional_sales_optimized AS (
  SELECT 
    region,
    country,
    YEAR(order_date) as order_year,
    QUARTER(order_date) as order_quarter,
    MONTH(order_date) as order_month,

    -- Optimized aggregation functions
    COUNT(*) as total_orders,
    COUNT(DISTINCT user_id) as unique_customers,
    SUM(total_amount) as total_revenue,
    AVG(total_amount) as avg_order_value,
    MEDIAN_APPROX(total_amount) as median_order_value,
    STDDEV(total_amount) as revenue_stddev,

    -- Advanced statistical functions
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY total_amount) as q1_order_value,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY total_amount) as q3_order_value,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY total_amount) as p95_order_value,

    -- Product diversity metrics
    COUNT(DISTINCT UNNEST(product_categories)) as unique_categories,
    AVG(ARRAY_LENGTH(product_categories)) as avg_categories_per_order,

    -- Customer behavior analysis
    COUNT(*) FILTER (WHERE total_amount > 1000) as high_value_orders,
    COUNT(*) FILTER (WHERE discount_amount > 0) as discounted_orders,
    AVG(discount_amount) as avg_discount,

    -- Time-based patterns
    COUNT(*) FILTER (WHERE EXTRACT(DOW FROM order_date) IN (0, 6)) as weekend_orders,
    COUNT(*) FILTER (WHERE EXTRACT(HOUR FROM order_date) BETWEEN 9 AND 17) as business_hours_orders,

    -- Customer tenure analysis
    AVG(EXTRACT(DAYS FROM order_date - user_registration_date)) as avg_customer_tenure,

    -- Seasonal indicators
    CASE 
      WHEN MONTH(order_date) IN (12, 1, 2) THEN 'winter'
      WHEN MONTH(order_date) IN (3, 4, 5) THEN 'spring'
      WHEN MONTH(order_date) IN (6, 7, 8) THEN 'summer'
      ELSE 'fall'
    END as season

  FROM orders o
  INNER JOIN users u ON o.user_id = u._id
  WHERE o.order_date >= CURRENT_DATE - INTERVAL '2 years'
    AND o.status IN ('completed', 'shipped', 'delivered')
    AND o.total_amount > 0
    AND u.status = 'active'
  GROUP BY region, country, order_year, order_quarter, order_month

  -- QueryLeaf optimization hints
  USING INDEX (order_date_status_user_idx)
  WITH AGGREGATION_OPTIONS (
    allow_disk_use = true,
    max_memory_usage = '200M',
    optimization_level = 'aggressive'
  )
),

-- Window functions for trend analysis and growth calculations
growth_analysis AS (
  SELECT 
    *,

    -- Period-over-period growth calculations
    LAG(total_revenue) OVER (
      PARTITION BY region, country 
      ORDER BY order_year, order_quarter, order_month
    ) as previous_period_revenue,

    LAG(unique_customers) OVER (
      PARTITION BY region, country
      ORDER BY order_year, order_quarter, order_month  
    ) as previous_period_customers,

    -- Moving averages for trend smoothing
    AVG(total_revenue) OVER (
      PARTITION BY region, country
      ORDER BY order_year, order_quarter, order_month
      ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ) as three_period_avg_revenue,

    AVG(avg_order_value) OVER (
      PARTITION BY region, country
      ORDER BY order_year, order_quarter, order_month
      ROWS BETWEEN 5 PRECEDING AND CURRENT ROW  
    ) as six_period_avg_aov,

    -- Rank and percentile calculations
    RANK() OVER (
      PARTITION BY order_year, order_quarter
      ORDER BY total_revenue DESC
    ) as revenue_rank,

    PERCENT_RANK() OVER (
      PARTITION BY order_year, order_quarter
      ORDER BY total_revenue
    ) as revenue_percentile,

    -- Running totals and cumulative metrics
    SUM(total_revenue) OVER (
      PARTITION BY region, country, order_year
      ORDER BY order_quarter, order_month
      ROWS UNBOUNDED PRECEDING
    ) as ytd_revenue,

    -- Anomaly detection using statistical functions
    ABS(total_revenue - AVG(total_revenue) OVER (
      PARTITION BY region, country
      ORDER BY order_year, order_quarter, order_month
      ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
    )) / STDDEV(total_revenue) OVER (
      PARTITION BY region, country  
      ORDER BY order_year, order_quarter, order_month
      ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
    ) as revenue_z_score

  FROM regional_sales_optimized
),

-- Customer segmentation using advanced analytics
customer_segmentation AS (
  SELECT 
    user_id,
    region,
    country,
    registration_date,

    -- RFM analysis (Recency, Frequency, Monetary)
    EXTRACT(DAYS FROM CURRENT_DATE - MAX(order_date)) as recency_days,
    COUNT(*) as frequency,
    SUM(total_amount) as monetary_value,
    AVG(total_amount) as avg_order_value,

    -- Advanced customer metrics
    MAX(order_date) - MIN(order_date) as customer_lifespan,
    COUNT(DISTINCT EXTRACT(QUARTER FROM order_date)) as active_quarters,
    STDDEV(total_amount) as order_consistency,

    -- Product affinity analysis
    COUNT(DISTINCT UNNEST(product_categories)) as category_diversity,
    MODE() WITHIN GROUP (ORDER BY UNNEST(product_categories)) as preferred_category,

    -- Seasonal behavior patterns
    AVG(total_amount) FILTER (WHERE season = 'winter') as winter_avg_spend,
    AVG(total_amount) FILTER (WHERE season = 'summer') as summer_avg_spend,

    -- Channel preference analysis  
    COUNT(*) FILTER (WHERE channel = 'mobile') as mobile_orders,
    COUNT(*) FILTER (WHERE channel = 'web') as web_orders,
    COUNT(*) FILTER (WHERE channel = 'store') as store_orders,

    -- Time-based behavior patterns
    AVG(EXTRACT(HOUR FROM order_timestamp)) as preferred_hour,
    COUNT(*) FILTER (WHERE EXTRACT(DOW FROM order_date) IN (0, 6)) / COUNT(*)::float as weekend_preference,

    -- Discount utilization patterns
    COUNT(*) FILTER (WHERE discount_amount > 0) / COUNT(*)::float as discount_utilization_rate,
    AVG(discount_amount) FILTER (WHERE discount_amount > 0) as avg_discount_when_used

  FROM orders o
  INNER JOIN users u ON o.user_id = u._id
  WHERE o.order_date >= CURRENT_DATE - INTERVAL '1 year'
    AND o.status IN ('completed', 'shipped', 'delivered')
    AND u.status = 'active'
  GROUP BY user_id, region, country, registration_date
),

-- RFM scoring and segmentation
customer_segments_scored AS (
  SELECT 
    *,

    -- RFM quintile scoring (1-5 scale)
    NTILE(5) OVER (ORDER BY recency_days DESC) as recency_score, -- Lower recency = higher score
    NTILE(5) OVER (ORDER BY frequency ASC) as frequency_score,
    NTILE(5) OVER (ORDER BY monetary_value ASC) as monetary_score,

    -- Comprehensive customer segment classification
    CASE 
      WHEN NTILE(5) OVER (ORDER BY recency_days DESC) >= 4 
           AND NTILE(5) OVER (ORDER BY frequency ASC) >= 4 
           AND NTILE(5) OVER (ORDER BY monetary_value ASC) >= 4 THEN 'champions'
      WHEN NTILE(5) OVER (ORDER BY recency_days DESC) >= 4 
           AND NTILE(5) OVER (ORDER BY frequency ASC) >= 3 
           AND NTILE(5) OVER (ORDER BY monetary_value ASC) >= 3 THEN 'loyal_customers'  
      WHEN NTILE(5) OVER (ORDER BY recency_days DESC) >= 4 
           AND NTILE(5) OVER (ORDER BY frequency ASC) <= 2 
           AND NTILE(5) OVER (ORDER BY monetary_value ASC) >= 3 THEN 'potential_loyalists'
      WHEN NTILE(5) OVER (ORDER BY recency_days DESC) >= 4 
           AND NTILE(5) OVER (ORDER BY frequency ASC) <= 1 THEN 'new_customers'
      WHEN NTILE(5) OVER (ORDER BY recency_days DESC) BETWEEN 2 AND 3 
           AND NTILE(5) OVER (ORDER BY frequency ASC) >= 3 THEN 'at_risk'
      WHEN NTILE(5) OVER (ORDER BY recency_days DESC) <= 2 
           AND NTILE(5) OVER (ORDER BY frequency ASC) >= 3 
           AND NTILE(5) OVER (ORDER BY monetary_value ASC) >= 3 THEN 'cannot_lose_them'
      WHEN NTILE(5) OVER (ORDER BY recency_days DESC) <= 2 
           AND NTILE(5) OVER (ORDER BY frequency ASC) <= 2 THEN 'hibernating'
      ELSE 'promising'
    END as customer_segment,

    -- Customer Lifetime Value prediction
    (avg_order_value * frequency * 
     CASE WHEN customer_lifespan > 0 THEN 365.0 / EXTRACT(DAYS FROM customer_lifespan) ELSE 12 END *
     (6 - LEAST(5, recency_days / 30.0))) as predicted_clv,

    -- Churn risk assessment
    CASE 
      WHEN recency_days > 180 THEN 'high_risk'
      WHEN recency_days > 90 THEN 'medium_risk'
      WHEN recency_days > 30 THEN 'low_risk'
      ELSE 'active'
    END as churn_risk,

    -- Channel preference classification
    CASE 
      WHEN mobile_orders > web_orders AND mobile_orders > store_orders THEN 'mobile_first'
      WHEN web_orders > mobile_orders AND web_orders > store_orders THEN 'web_first' 
      WHEN store_orders > mobile_orders AND store_orders > web_orders THEN 'store_first'
      ELSE 'omnichannel'
    END as channel_preference

  FROM customer_segmentation
),

-- Comprehensive business intelligence summary
business_intelligence_summary AS (
  SELECT 
    ga.region,
    ga.country,
    ga.order_year,
    ga.order_quarter,

    -- Performance metrics with growth indicators
    ga.total_revenue,
    ga.unique_customers,
    ga.avg_order_value,

    -- Growth calculations
    CASE 
      WHEN ga.previous_period_revenue > 0 THEN
        ROUND(((ga.total_revenue - ga.previous_period_revenue) / ga.previous_period_revenue * 100)::numeric, 2)
      ELSE NULL
    END as revenue_growth_pct,

    CASE
      WHEN ga.previous_period_customers > 0 THEN
        ROUND(((ga.unique_customers - ga.previous_period_customers) / ga.previous_period_customers * 100)::numeric, 2) 
      ELSE NULL
    END as customer_growth_pct,

    -- Trend indicators
    CASE 
      WHEN ga.total_revenue > ga.three_period_avg_revenue * 1.1 THEN 'growing'
      WHEN ga.total_revenue < ga.three_period_avg_revenue * 0.9 THEN 'declining'
      ELSE 'stable'
    END as revenue_trend,

    -- Performance rankings
    ga.revenue_rank,
    ga.revenue_percentile,

    -- Anomaly detection
    CASE 
      WHEN ga.revenue_z_score > 2 THEN 'positive_anomaly'
      WHEN ga.revenue_z_score < -2 THEN 'negative_anomaly'
      ELSE 'normal'
    END as anomaly_status,

    -- Customer segment distribution
    css.champions_count,
    css.loyal_customers_count,
    css.at_risk_count,
    css.hibernating_count,

    -- Customer value metrics
    css.avg_predicted_clv,
    css.high_risk_customers,

    -- Channel distribution
    css.mobile_first_customers,
    css.web_first_customers,
    css.omnichannel_customers,

    -- Strategic recommendations
    CASE 
      WHEN ga.revenue_growth_pct < -10 AND css.at_risk_count > css.loyal_customers_count THEN 'urgent_retention_focus'
      WHEN ga.revenue_growth_pct > 20 AND ga.revenue_rank <= 5 THEN 'scale_and_expand'
      WHEN css.hibernating_count > css.champions_count THEN 'reactivation_campaign'  
      WHEN ga.avg_order_value < ga.six_period_avg_aov * 0.9 THEN 'upselling_opportunity'
      ELSE 'maintain_momentum'
    END as strategic_recommendation

  FROM growth_analysis ga
  LEFT JOIN (
    SELECT 
      region,
      country,
      COUNT(*) FILTER (WHERE customer_segment = 'champions') as champions_count,
      COUNT(*) FILTER (WHERE customer_segment = 'loyal_customers') as loyal_customers_count,
      COUNT(*) FILTER (WHERE customer_segment = 'at_risk') as at_risk_count,
      COUNT(*) FILTER (WHERE customer_segment = 'hibernating') as hibernating_count,
      AVG(predicted_clv) as avg_predicted_clv,
      COUNT(*) FILTER (WHERE churn_risk = 'high_risk') as high_risk_customers,
      COUNT(*) FILTER (WHERE channel_preference = 'mobile_first') as mobile_first_customers,
      COUNT(*) FILTER (WHERE channel_preference = 'web_first') as web_first_customers,
      COUNT(*) FILTER (WHERE channel_preference = 'omnichannel') as omnichannel_customers
    FROM customer_segments_scored
    GROUP BY region, country
  ) css ON ga.region = css.region AND ga.country = css.country

  WHERE ga.order_year >= 2023
)

SELECT 
  region,
  country,
  order_year,
  order_quarter,

  -- Core performance metrics
  total_revenue,
  unique_customers,
  ROUND(avg_order_value::numeric, 2) as avg_order_value,

  -- Growth indicators
  COALESCE(revenue_growth_pct, 0) as revenue_growth_pct,
  COALESCE(customer_growth_pct, 0) as customer_growth_pct,
  revenue_trend,

  -- Market position
  revenue_rank,
  ROUND((revenue_percentile * 100)::numeric, 1) as revenue_percentile_rank,
  anomaly_status,

  -- Customer portfolio health
  COALESCE(champions_count, 0) as champions,
  COALESCE(loyal_customers_count, 0) as loyal_customers,
  COALESCE(at_risk_count, 0) as at_risk_customers,
  COALESCE(high_risk_customers, 0) as churn_risk_customers,

  -- Channel insights
  COALESCE(mobile_first_customers, 0) as mobile_focused,
  COALESCE(omnichannel_customers, 0) as omnichannel_users,

  -- Value predictions
  ROUND(COALESCE(avg_predicted_clv, 0)::numeric, 2) as avg_customer_ltv,

  -- Strategic guidance
  strategic_recommendation,

  -- Executive summary scoring
  CASE 
    WHEN revenue_growth_pct > 15 AND revenue_rank <= 3 THEN 'excellent'
    WHEN revenue_growth_pct > 5 AND revenue_rank <= 10 THEN 'good' 
    WHEN revenue_growth_pct >= 0 OR revenue_rank <= 20 THEN 'acceptable'
    ELSE 'needs_improvement'
  END as overall_performance_grade

FROM business_intelligence_summary
ORDER BY 
  order_year DESC,
  order_quarter DESC,
  total_revenue DESC
LIMIT 100;

-- Real-time aggregation pipeline with change streams
CREATE MATERIALIZED VIEW real_time_metrics AS
SELECT 
  DATE_TRUNC('minute', order_timestamp, 5) as time_bucket, -- 5-minute buckets
  region,

  -- Real-time KPIs
  COUNT(*) as orders_per_5min,
  SUM(total_amount) as revenue_per_5min,
  COUNT(DISTINCT user_id) as unique_customers_5min,
  AVG(total_amount) as avg_order_value_5min,

  -- Velocity metrics
  COUNT(*) / 5.0 as orders_per_minute,
  SUM(total_amount) / 5.0 as revenue_per_minute,

  -- Performance alerts
  CASE 
    WHEN COUNT(*) > 1000 THEN 'high_volume_alert'
    WHEN AVG(total_amount) < 50 THEN 'low_aov_alert'
    WHEN COUNT(DISTINCT user_id) / COUNT(*)::float < 0.7 THEN 'retention_concern'
    ELSE 'normal'
  END as alert_status,

  -- Trend indicators
  LAG(SUM(total_amount)) OVER (
    PARTITION BY region 
    ORDER BY DATE_TRUNC('minute', order_timestamp, 5)
  ) as previous_bucket_revenue,

  CURRENT_TIMESTAMP as computed_at

FROM orders
WHERE order_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
  AND status IN ('completed', 'shipped', 'delivered')
GROUP BY DATE_TRUNC('minute', order_timestamp, 5), region

-- QueryLeaf optimization features:
WITH AGGREGATION_SETTINGS (
  refresh_interval = '1 minute',
  allow_disk_use = true,
  max_memory_usage = '500M',
  parallel_processing = true,
  index_hints = ['order_timestamp_region_idx', 'user_status_idx'],
  change_stream_enabled = true
);

-- QueryLeaf provides comprehensive aggregation optimization:
-- 1. SQL-familiar syntax for complex MongoDB aggregation pipelines
-- 2. Automatic pipeline optimization with index hints and memory management
-- 3. Advanced window functions and statistical operations for analytics
-- 4. Real-time aggregation capabilities with change streams integration  
-- 5. Performance monitoring and explain plan analysis tools
-- 6. Materialized view support for frequently accessed aggregations
-- 7. Customer segmentation and RFM analysis with built-in algorithms
-- 8. Anomaly detection and alerting capabilities for operational intelligence
-- 9. Growth analysis and trend calculation functions
-- 10. Strategic business intelligence reporting with actionable insights

Best Practices for Aggregation Framework Optimization

Pipeline Design Strategy

Essential principles for building high-performance MongoDB aggregation pipelines:

  1. Early Stage Filtering: Place $match stages as early as possible to reduce documents flowing through the pipeline
  2. Index Utilization: Design indexes specifically for aggregation query patterns and filter conditions
  3. Stage Ordering: Order stages to minimize memory usage and maximize index effectiveness
  4. Memory Management: Use allowDiskUse for large dataset operations and monitor memory consumption
  5. Pipeline Composition: Break complex pipelines into reusable, testable components
  6. Performance Monitoring: Implement comprehensive explain plan analysis and execution time tracking

Production Optimization Techniques

Optimize MongoDB aggregation pipelines for production-scale workloads:

  1. Index Strategy: Create compound indexes aligned with aggregation filter and grouping patterns
  2. Memory Optimization: Balance memory usage with disk spillover for optimal performance
  3. Parallel Processing: Leverage MongoDB's parallel processing capabilities for large dataset aggregations
  4. Caching Strategies: Implement result caching and materialized views for frequently accessed aggregations
  5. Real-time Analytics: Use change streams and incremental processing for real-time analytical workloads
  6. Monitoring Integration: Deploy comprehensive performance monitoring and alerting for production pipelines

Conclusion

MongoDB's Aggregation Framework provides a powerful, flexible foundation for complex data processing and analytics that scales from simple transformations to sophisticated analytical workloads. The pipeline-based architecture enables clear, maintainable data processing workflows with extensive optimization opportunities that support both real-time and batch processing scenarios.

Key MongoDB Aggregation Framework benefits include:

  • Pipeline Clarity: Stage-based design that promotes clear, maintainable data transformation logic
  • Performance Optimization: Sophisticated optimization engine with index utilization and memory management
  • Analytical Power: Rich statistical functions and window operations for advanced analytics
  • Scalability: Horizontal scaling capabilities that support growing analytical requirements
  • Flexibility: Adaptable pipeline patterns that evolve with changing business requirements
  • Integration: Seamless integration with MongoDB's document model and distributed architecture

Whether you're building real-time dashboards, customer segmentation systems, business intelligence platforms, or complex analytical applications, MongoDB's Aggregation Framework with QueryLeaf's familiar SQL interface provides the foundation for high-performance data processing at scale.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB aggregation pipelines while providing SQL-familiar syntax for complex analytics, window functions, and statistical operations. Advanced aggregation patterns, performance optimization, and real-time analytics capabilities are seamlessly accessible through familiar SQL constructs, making sophisticated data processing both powerful and approachable for SQL-oriented development teams.

The combination of MongoDB's flexible aggregation capabilities with SQL-style operations makes it an ideal platform for modern analytical applications that require both high performance and rapid development cycles, ensuring your data processing workflows can scale efficiently while remaining maintainable and adaptable to evolving business needs.

MongoDB Aggregation Pipeline Performance Optimization: Advanced Techniques for High-Performance Data Processing and Analytics

Modern applications increasingly rely on complex data analytics, real-time reporting, and sophisticated data transformations that demand high-performance aggregation capabilities. Poor aggregation pipeline design can lead to slow response times, excessive memory usage, and resource bottlenecks that become critical performance issues as data volumes and analytical complexity grow.

MongoDB's aggregation framework provides powerful capabilities for data processing, analysis, and transformation that can handle complex analytical workloads efficiently when properly optimized. Unlike limited relational database aggregation approaches, MongoDB pipelines support flexible document processing, nested data analysis, and sophisticated transformations that align with modern application requirements while maintaining performance at scale.

The Traditional Database Aggregation Limitations

Conventional relational database aggregation approaches impose significant constraints for modern analytical workloads:

-- Traditional PostgreSQL aggregation - rigid structure with performance limitations

-- Basic aggregation with limited optimization potential
WITH customer_metrics AS (
  SELECT 
    u.user_id,
    u.country,
    u.registration_date,
    u.status,
    COUNT(o.order_id) as order_count,
    SUM(o.total_amount) as total_spent,
    AVG(o.total_amount) as avg_order_value,
    MAX(o.created_at) as last_order_date,

    -- Limited JSON aggregation capabilities
    COUNT(CASE WHEN o.status = 'completed' THEN 1 END) as completed_orders,
    COUNT(CASE WHEN o.status = 'pending' THEN 1 END) as pending_orders,
    COUNT(CASE WHEN o.status = 'cancelled' THEN 1 END) as cancelled_orders,

    -- Basic window functions
    ROW_NUMBER() OVER (PARTITION BY u.country ORDER BY SUM(o.total_amount) DESC) as country_rank,
    PERCENT_RANK() OVER (ORDER BY SUM(o.total_amount)) as spending_percentile

  FROM users u
  LEFT JOIN orders o ON u.user_id = o.user_id
  WHERE u.registration_date >= CURRENT_DATE - INTERVAL '2 years'
    AND u.status = 'active'
  GROUP BY u.user_id, u.country, u.registration_date, u.status
),

product_analysis AS (
  SELECT 
    p.product_id,
    p.category,
    p.brand,
    p.price,
    COUNT(oi.order_item_id) as times_ordered,
    SUM(oi.quantity) as total_quantity_sold,
    SUM(oi.quantity * oi.unit_price) as total_revenue,

    -- Limited array and JSON processing
    AVG(CAST(r.rating AS NUMERIC)) as avg_rating,
    COUNT(r.review_id) as review_count,

    -- Complex subquery for related data
    (SELECT STRING_AGG(DISTINCT c.name, ', ') 
     FROM categories c 
     JOIN product_categories pc ON c.category_id = pc.category_id 
     WHERE pc.product_id = p.product_id
    ) as category_names,

    -- Percentile calculations require window functions
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY oi.unit_price) as price_q1,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY oi.unit_price) as price_median,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY oi.unit_price) as price_q3

  FROM products p
  LEFT JOIN order_items oi ON p.product_id = oi.product_id
  LEFT JOIN orders o ON oi.order_id = o.order_id
  LEFT JOIN reviews r ON p.product_id = r.product_id
  WHERE o.status = 'completed'
    AND o.created_at >= CURRENT_DATE - INTERVAL '1 year'
  GROUP BY p.product_id, p.category, p.brand, p.price
),

sales_trends AS (
  SELECT 
    DATE_TRUNC('month', o.created_at) as month,
    u.country,
    p.category,
    COUNT(o.order_id) as orders,
    SUM(o.total_amount) as revenue,
    COUNT(DISTINCT u.user_id) as unique_customers,
    AVG(o.total_amount) as avg_order_value,

    -- Complex trend calculations
    LAG(SUM(o.total_amount)) OVER (
      PARTITION BY u.country, p.category 
      ORDER BY DATE_TRUNC('month', o.created_at)
    ) as prev_month_revenue,

    -- Percentage change calculation
    CASE 
      WHEN LAG(SUM(o.total_amount)) OVER (
        PARTITION BY u.country, p.category 
        ORDER BY DATE_TRUNC('month', o.created_at)
      ) > 0 THEN
        ROUND(
          (SUM(o.total_amount) - LAG(SUM(o.total_amount)) OVER (
            PARTITION BY u.country, p.category 
            ORDER BY DATE_TRUNC('month', o.created_at)
          )) / LAG(SUM(o.total_amount)) OVER (
            PARTITION BY u.country, p.category 
            ORDER BY DATE_TRUNC('month', o.created_at)
          ) * 100, 2
        )
      ELSE NULL
    END as revenue_growth_pct

  FROM orders o
  JOIN users u ON o.user_id = u.user_id
  JOIN order_items oi ON o.order_id = oi.order_id
  JOIN products p ON oi.product_id = p.product_id
  WHERE o.status = 'completed'
    AND o.created_at >= CURRENT_DATE - INTERVAL '18 months'
  GROUP BY DATE_TRUNC('month', o.created_at), u.country, p.category
)

-- Final complex analytical query with multiple CTEs
SELECT 
  cm.country,
  COUNT(DISTINCT cm.user_id) as total_customers,
  SUM(cm.total_spent) as country_revenue,
  AVG(cm.avg_order_value) as country_avg_order_value,

  -- Customer segmentation
  COUNT(CASE WHEN cm.total_spent > 1000 THEN 1 END) as high_value_customers,
  COUNT(CASE WHEN cm.total_spent BETWEEN 100 AND 1000 THEN 1 END) as medium_value_customers,
  COUNT(CASE WHEN cm.total_spent < 100 THEN 1 END) as low_value_customers,

  -- Activity analysis
  COUNT(CASE WHEN cm.last_order_date >= CURRENT_DATE - INTERVAL '30 days' THEN 1 END) as recent_customers,
  COUNT(CASE WHEN cm.last_order_date < CURRENT_DATE - INTERVAL '90 days' THEN 1 END) as inactive_customers,

  -- Product performance correlation
  (SELECT AVG(pa.avg_rating) 
   FROM product_analysis pa 
   JOIN order_items oi ON pa.product_id = oi.product_id 
   JOIN orders o ON oi.order_id = o.order_id 
   JOIN users u ON o.user_id = u.user_id 
   WHERE u.country = cm.country) as avg_product_rating,

  -- Sales trend analysis
  (SELECT AVG(st.revenue_growth_pct) 
   FROM sales_trends st 
   WHERE st.country = cm.country 
     AND st.month >= CURRENT_DATE - INTERVAL '6 months') as avg_growth_rate,

  -- Market share calculation
  ROUND(
    SUM(cm.total_spent) / 
    (SELECT SUM(total_spent) FROM customer_metrics) * 100, 2
  ) as market_share_pct,

  -- Customer concentration (top 20% of customers by spending)
  COUNT(CASE WHEN cm.spending_percentile >= 0.8 THEN 1 END) as top_tier_customers,

  -- Ranking by country performance
  RANK() OVER (ORDER BY SUM(cm.total_spent) DESC) as country_rank,
  DENSE_RANK() OVER (ORDER BY AVG(cm.avg_order_value) DESC) as aov_rank

FROM customer_metrics cm
GROUP BY cm.country
HAVING COUNT(DISTINCT cm.user_id) >= 100  -- Filter countries with sufficient data
ORDER BY SUM(cm.total_spent) DESC, AVG(cm.avg_order_value) DESC;

-- PostgreSQL aggregation problems:
-- 1. Complex multi-table joins required for nested data analysis
-- 2. Limited support for dynamic grouping and flexible document structures
-- 3. Poor performance with large datasets requiring multiple table scans
-- 4. Inflexible aggregation stages that cannot be easily reordered or optimized
-- 5. Basic JSON aggregation capabilities with limited nested field support
-- 6. Complex window function syntax for trend analysis and rankings
-- 7. Inefficient handling of array fields and multi-value attributes
-- 8. Limited memory management options for large aggregation operations
-- 9. Rigid aggregation pipeline that cannot adapt to varying data patterns
-- 10. Poor integration with modern application data structures

-- Additional performance issues:
-- - Memory exhaustion with large GROUP BY operations
-- - Nested subquery performance degradation
-- - Complex JOIN operations across multiple large tables
-- - Limited parallel processing capabilities for aggregation stages
-- - Inefficient handling of sparse data and optional fields

-- MySQL approach (even more limited)
SELECT 
  u.country,
  COUNT(DISTINCT u.user_id) as customers,
  COUNT(o.order_id) as orders,
  SUM(o.total_amount) as revenue,
  AVG(o.total_amount) as avg_order_value,

  -- Basic JSON functions (limited capabilities)
  AVG(CAST(JSON_EXTRACT(u.profile, '$.age') AS SIGNED)) as avg_age,
  COUNT(CASE WHEN JSON_EXTRACT(u.preferences, '$.newsletter') = true THEN 1 END) as newsletter_subscribers

FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
WHERE u.status = 'active'
  AND o.created_at >= DATE_SUB(CURDATE(), INTERVAL 1 YEAR)
GROUP BY u.country
HAVING COUNT(DISTINCT u.user_id) >= 50
ORDER BY SUM(o.total_amount) DESC;

-- MySQL limitations:
-- - Very basic JSON aggregation functions
-- - Limited window function support in older versions
-- - Poor performance with complex aggregations
-- - Basic GROUP BY optimization
-- - Limited support for nested data analysis
-- - Minimal analytical function capabilities
-- - Simple aggregation pipeline with rigid structure

MongoDB's aggregation pipeline provides comprehensive, optimized data processing:

// MongoDB Advanced Aggregation Pipeline - flexible, powerful, and performance-optimized
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('ecommerce_analytics');

// Advanced MongoDB aggregation pipeline manager
class MongoAggregationOptimizer {
  constructor(db) {
    this.db = db;
    this.collections = {
      users: db.collection('users'),
      orders: db.collection('orders'),
      products: db.collection('products'),
      reviews: db.collection('reviews'),
      analytics: db.collection('analytics')
    };
    this.pipelineCache = new Map();
    this.performanceTargets = {
      maxExecutionTime: 5000, // 5 seconds for complex analytics
      maxMemoryUsage: 100, // 100MB memory limit
      maxStages: 20 // Maximum pipeline stages
    };
  }

  async buildComprehensiveAnalyticsPipeline() {
    console.log('Building comprehensive analytics aggregation pipeline...');

    // Advanced customer analytics with optimized pipeline
    const customerAnalyticsPipeline = [
      // Stage 1: Initial match to reduce dataset early
      {
        $match: {
          status: 'active',
          createdAt: { $gte: new Date(Date.now() - 2 * 365 * 24 * 60 * 60 * 1000) }, // Last 2 years
          totalSpent: { $exists: true }
        }
      },

      // Stage 2: Project only required fields to reduce memory usage
      {
        $project: {
          userId: '$_id',
          country: 1,
          status: 1,
          createdAt: 1,
          totalSpent: 1,
          loyaltyTier: 1,
          preferences: 1,
          // Create computed fields early in pipeline
          registrationYear: { $year: '$createdAt' },
          registrationMonth: { $month: '$createdAt' },
          customerAge: {
            $divide: [
              { $subtract: [new Date(), '$createdAt'] },
              365 * 24 * 60 * 60 * 1000 // Convert to years
            ]
          }
        }
      },

      // Stage 3: Lookup orders with targeted fields only
      {
        $lookup: {
          from: 'orders',
          localField: 'userId',
          foreignField: 'userId',
          as: 'orders',
          pipeline: [
            {
              $match: {
                status: { $in: ['completed', 'pending', 'cancelled'] },
                createdAt: { $gte: new Date(Date.now() - 365 * 24 * 60 * 60 * 1000) } // Last year
              }
            },
            {
              $project: {
                orderId: '$_id',
                status: 1,
                totalAmount: 1,
                createdAt: 1,
                items: {
                  $map: {
                    input: '$items',
                    as: 'item',
                    in: {
                      productId: '$$item.productId',
                      quantity: '$$item.quantity',
                      unitPrice: '$$item.unitPrice',
                      category: '$$item.category'
                    }
                  }
                }
              }
            }
          ]
        }
      },

      // Stage 4: Add computed fields for customer analysis
      {
        $addFields: {
          // Order statistics
          orderCount: { $size: '$orders' },
          completedOrders: {
            $size: {
              $filter: {
                input: '$orders',
                cond: { $eq: ['$$this.status', 'completed'] }
              }
            }
          },
          pendingOrders: {
            $size: {
              $filter: {
                input: '$orders',
                cond: { $eq: ['$$this.status', 'pending'] }
              }
            }
          },
          cancelledOrders: {
            $size: {
              $filter: {
                input: '$orders',
                cond: { $eq: ['$$this.status', 'cancelled'] }
              }
            }
          },

          // Revenue calculations
          totalRevenue: {
            $sum: {
              $map: {
                input: {
                  $filter: {
                    input: '$orders',
                    cond: { $eq: ['$$this.status', 'completed'] }
                  }
                },
                as: 'order',
                in: '$$order.totalAmount'
              }
            }
          },

          // Customer behavior analysis
          avgOrderValue: {
            $cond: {
              if: { $gt: [{ $size: '$orders' }, 0] },
              then: {
                $avg: {
                  $map: {
                    input: {
                      $filter: {
                        input: '$orders',
                        cond: { $eq: ['$$this.status', 'completed'] }
                      }
                    },
                    as: 'order',
                    in: '$$order.totalAmount'
                  }
                }
              },
              else: 0
            }
          },

          // Recency analysis
          lastOrderDate: {
            $max: {
              $map: {
                input: '$orders',
                as: 'order',
                in: '$$order.createdAt'
              }
            }
          },

          // Product diversity analysis
          uniqueCategories: {
            $size: {
              $setUnion: {
                $reduce: {
                  input: '$orders',
                  initialValue: [],
                  in: {
                    $setUnion: [
                      '$$value',
                      {
                        $map: {
                          input: '$$this.items',
                          as: 'item',
                          in: '$$item.category'
                        }
                      }
                    ]
                  }
                }
              }
            }
          }
        }
      },

      // Stage 5: Customer segmentation
      {
        $addFields: {
          // Value segmentation
          valueSegment: {
            $switch: {
              branches: [
                {
                  case: { $gte: ['$totalRevenue', 1000] },
                  then: 'high_value'
                },
                {
                  case: { $gte: ['$totalRevenue', 100] },
                  then: 'medium_value'
                }
              ],
              default: 'low_value'
            }
          },

          // Activity segmentation
          activitySegment: {
            $switch: {
              branches: [
                {
                  case: {
                    $gte: [
                      '$lastOrderDate',
                      new Date(Date.now() - 30 * 24 * 60 * 60 * 1000)
                    ]
                  },
                  then: 'active'
                },
                {
                  case: {
                    $gte: [
                      '$lastOrderDate',
                      new Date(Date.now() - 90 * 24 * 60 * 60 * 1000)
                    ]
                  },
                  then: 'recent'
                },
                {
                  case: {
                    $gte: [
                      '$lastOrderDate',
                      new Date(Date.now() - 180 * 24 * 60 * 60 * 1000)
                    ]
                  },
                  then: 'inactive'
                }
              ],
              default: 'dormant'
            }
          },

          // Engagement scoring
          engagementScore: {
            $add: [
              // Order frequency component (0-40 points)
              { $multiply: [{ $min: ['$orderCount', 10] }, 4] },

              // Revenue component (0-30 points)
              { $multiply: [{ $min: [{ $divide: ['$totalRevenue', 100] }, 10] }, 3] },

              // Category diversity component (0-20 points)
              { $multiply: [{ $min: ['$uniqueCategories', 10] }, 2] },

              // Recency component (0-10 points)
              {
                $cond: {
                  if: {
                    $gte: [
                      '$lastOrderDate',
                      new Date(Date.now() - 30 * 24 * 60 * 60 * 1000)
                    ]
                  },
                  then: 10,
                  else: {
                    $cond: {
                      if: {
                        $gte: [
                          '$lastOrderDate',
                          new Date(Date.now() - 90 * 24 * 60 * 60 * 1000)
                        ]
                      },
                      then: 5,
                      else: 0
                    }
                  }
                }
              }
            ]
          }
        }
      },

      // Stage 6: Group by country and segments for analysis
      {
        $group: {
          _id: {
            country: '$country',
            valueSegment: '$valueSegment',
            activitySegment: '$activitySegment'
          },

          // Customer counts
          customerCount: { $sum: 1 },

          // Revenue metrics
          totalRevenue: { $sum: '$totalRevenue' },
          avgRevenue: { $avg: '$totalRevenue' },
          maxRevenue: { $max: '$totalRevenue' },
          minRevenue: { $min: '$totalRevenue' },

          // Order metrics
          totalOrders: { $sum: '$orderCount' },
          avgOrdersPerCustomer: { $avg: '$orderCount' },
          totalCompletedOrders: { $sum: '$completedOrders' },

          // Behavioral metrics
          avgOrderValue: { $avg: '$avgOrderValue' },
          avgEngagementScore: { $avg: '$engagementScore' },
          avgCategoryDiversity: { $avg: '$uniqueCategories' },

          // Customer lifecycle metrics
          avgCustomerAge: { $avg: '$customerAge' },

          // Statistical measures
          revenueStdDev: { $stdDevPop: '$totalRevenue' },
          engagementStdDev: { $stdDevPop: '$engagementScore' },

          // Percentile calculations using $bucketAuto approach
          customers: {
            $push: {
              userId: '$userId',
              totalRevenue: '$totalRevenue',
              engagementScore: '$engagementScore',
              orderCount: '$orderCount'
            }
          }
        }
      },

      // Stage 7: Calculate percentiles and advanced metrics
      {
        $addFields: {
          // Revenue percentiles
          revenuePercentiles: {
            $let: {
              vars: {
                sortedRevenues: {
                  $map: {
                    input: {
                      $sortArray: {
                        input: '$customers.totalRevenue',
                        sortBy: 1
                      }
                    },
                    as: 'rev',
                    in: '$$rev'
                  }
                }
              },
              in: {
                p25: {
                  $arrayElemAt: [
                    '$$sortedRevenues',
                    { $floor: { $multiply: [{ $size: '$$sortedRevenues' }, 0.25] } }
                  ]
                },
                p50: {
                  $arrayElemAt: [
                    '$$sortedRevenues',
                    { $floor: { $multiply: [{ $size: '$$sortedRevenues' }, 0.5] } }
                  ]
                },
                p75: {
                  $arrayElemAt: [
                    '$$sortedRevenues',
                    { $floor: { $multiply: [{ $size: '$$sortedRevenues' }, 0.75] } }
                  ]
                },
                p90: {
                  $arrayElemAt: [
                    '$$sortedRevenues',
                    { $floor: { $multiply: [{ $size: '$$sortedRevenues' }, 0.9] } }
                  ]
                }
              }
            }
          },

          // Customer concentration metrics
          topCustomerRevenue: {
            $sum: {
              $slice: [
                {
                  $sortArray: {
                    input: '$customers.totalRevenue',
                    sortBy: -1
                  }
                },
                { $min: [{ $ceil: { $multiply: ['$customerCount', 0.2] } }, 10] }
              ]
            }
          }
        }
      },

      // Stage 8: Add market analysis
      {
        $addFields: {
          // Customer concentration (top 20% revenue share)
          customerConcentration: {
            $divide: ['$topCustomerRevenue', '$totalRevenue']
          },

          // Segment performance indicators
          performanceIndicators: {
            revenuePerCustomer: { $divide: ['$totalRevenue', '$customerCount'] },
            ordersPerCustomer: { $divide: ['$totalOrders', '$customerCount'] },
            completionRate: {
              $cond: {
                if: { $gt: ['$totalOrders', 0] },
                then: { $divide: ['$totalCompletedOrders', '$totalOrders'] },
                else: 0
              }
            }
          },

          // Growth potential scoring
          growthPotential: {
            $switch: {
              branches: [
                {
                  case: {
                    $and: [
                      { $eq: ['$_id.valueSegment', 'high_value'] },
                      { $eq: ['$_id.activitySegment', 'active'] }
                    ]
                  },
                  then: 'maintain'
                },
                {
                  case: {
                    $and: [
                      { $eq: ['$_id.valueSegment', 'high_value'] },
                      { $ne: ['$_id.activitySegment', 'active'] }
                    ]
                  },
                  then: 'reactivate'
                },
                {
                  case: {
                    $and: [
                      { $ne: ['$_id.valueSegment', 'low_value'] },
                      { $eq: ['$_id.activitySegment', 'active'] }
                    ]
                  },
                  then: 'upsell'
                },
                {
                  case: { $eq: ['$_id.activitySegment', 'dormant'] },
                  then: 'winback'
                }
              ],
              default: 'nurture'
            }
          }
        }
      },

      // Stage 9: Remove detailed customer data to reduce output size
      {
        $project: {
          customers: 0 // Remove large array to optimize output
        }
      },

      // Stage 10: Sort by strategic importance
      {
        $sort: {
          totalRevenue: -1,
          customerCount: -1,
          '_id.country': 1
        }
      },

      // Stage 11: Add final computed fields for presentation
      {
        $addFields: {
          segmentId: {
            $concat: [
              '$_id.country',
              '_',
              '$_id.valueSegment',
              '_',
              '$_id.activitySegment'
            ]
          },

          // Strategic priority scoring
          strategicPriority: {
            $add: [
              // Revenue weight (40%)
              { $multiply: [{ $divide: ['$totalRevenue', 10000] }, 0.4] },

              // Customer count weight (30%)
              { $multiply: [{ $divide: ['$customerCount', 100] }, 0.3] },

              // Engagement weight (20%)
              { $multiply: [{ $divide: ['$avgEngagementScore', 100] }, 0.2] },

              // Growth potential weight (10%)
              {
                $switch: {
                  branches: [
                    { case: { $eq: ['$growthPotential', 'upsell'] }, then: 0.1 },
                    { case: { $eq: ['$growthPotential', 'reactivate'] }, then: 0.08 },
                    { case: { $eq: ['$growthPotential', 'maintain'] }, then: 0.06 },
                    { case: { $eq: ['$growthPotential', 'nurture'] }, then: 0.04 }
                  ],
                  default: 0.02
                }
              }
            ]
          }
        }
      }
    ];

    console.log('Executing comprehensive customer analytics pipeline...');
    const startTime = Date.now();

    try {
      const results = await this.collections.users.aggregate(
        customerAnalyticsPipeline,
        {
          allowDiskUse: true, // Enable disk usage for large datasets
          maxTimeMS: this.performanceTargets.maxExecutionTime,
          hint: { status: 1, createdAt: 1, totalSpent: 1 }, // Suggest optimal index
          cursor: { batchSize: 1000 } // Optimize cursor batch size
        }
      ).toArray();

      const executionTime = Date.now() - startTime;

      console.log(`Pipeline executed successfully in ${executionTime}ms`);
      console.log(`Processed ${results.length} customer segments`);

      // Cache results for performance optimization
      this.pipelineCache.set('customer_analytics', {
        results,
        timestamp: new Date(),
        executionTime
      });

      return {
        results,
        executionStats: {
          executionTime,
          segmentsAnalyzed: results.length,
          performanceGrade: this.calculatePerformanceGrade(executionTime)
        }
      };

    } catch (error) {
      console.error('Pipeline execution failed:', error);
      throw error;
    }
  }

  async buildProductPerformanceAnalytics() {
    console.log('Building product performance analytics pipeline...');

    const productAnalyticsPipeline = [
      // Stage 1: Match active products with recent sales
      {
        $match: {
          status: 'active',
          createdAt: { $gte: new Date(Date.now() - 365 * 24 * 60 * 60 * 1000) }
        }
      },

      // Stage 2: Lookup orders and reviews with sub-pipeline optimization
      {
        $lookup: {
          from: 'orders',
          let: { productId: '$_id' },
          pipeline: [
            {
              $match: {
                $expr: {
                  $and: [
                    { $in: ['$$productId', '$items.productId'] },
                    { $eq: ['$status', 'completed'] },
                    { $gte: ['$createdAt', new Date(Date.now() - 365 * 24 * 60 * 60 * 1000)] }
                  ]
                }
              }
            },
            {
              $unwind: '$items'
            },
            {
              $match: {
                $expr: { $eq: ['$items.productId', '$$productId'] }
              }
            },
            {
              $project: {
                orderId: '$_id',
                userId: 1,
                createdAt: 1,
                quantity: '$items.quantity',
                unitPrice: '$items.unitPrice',
                revenue: { $multiply: ['$items.quantity', '$items.unitPrice'] }
              }
            }
          ],
          as: 'sales'
        }
      },

      // Stage 3: Lookup reviews with aggregation
      {
        $lookup: {
          from: 'reviews',
          localField: '_id',
          foreignField: 'productId',
          pipeline: [
            {
              $match: {
                status: 'published',
                rating: { $gte: 1, $lte: 5 }
              }
            },
            {
              $group: {
                _id: null,
                avgRating: { $avg: '$rating' },
                reviewCount: { $sum: 1 },
                ratingDistribution: {
                  $push: {
                    rating: '$rating',
                    helpful: '$helpfulVotes',
                    sentiment: '$sentiment'
                  }
                }
              }
            }
          ],
          as: 'reviewMetrics'
        }
      },

      // Stage 4: Calculate comprehensive product metrics
      {
        $addFields: {
          // Sales performance
          totalSales: { $size: '$sales' },
          totalRevenue: { $sum: '$sales.revenue' },
          totalQuantitySold: { $sum: '$sales.quantity' },
          avgOrderQuantity: { $avg: '$sales.quantity' },
          avgUnitPrice: { $avg: '$sales.unitPrice' },

          // Customer metrics
          uniqueCustomers: {
            $size: {
              $setUnion: {
                $map: {
                  input: '$sales',
                  as: 'sale',
                  in: '$$sale.userId'
                }
              }
            }
          },

          // Temporal analysis
          salesByMonth: {
            $reduce: {
              input: {
                $map: {
                  input: '$sales',
                  as: 'sale',
                  in: {
                    month: { $dateToString: { format: '%Y-%m', date: '$$sale.createdAt' } },
                    revenue: '$$sale.revenue',
                    quantity: '$$sale.quantity'
                  }
                }
              },
              initialValue: {},
              in: {
                $mergeObjects: [
                  '$$value',
                  {
                    $arrayToObject: [
                      [{
                        k: '$$this.month',
                        v: {
                          revenue: { $add: [{ $ifNull: [{ $getField: { field: '$$this.month', input: '$$value' } }.revenue, 0] }, '$$this.revenue'] },
                          quantity: { $add: [{ $ifNull: [{ $getField: { field: '$$this.month', input: '$$value' } }.quantity, 0] }, '$$this.quantity'] },
                          orders: { $add: [{ $ifNull: [{ $getField: { field: '$$this.month', input: '$$value' } }.orders, 0] }, 1] }
                        }
                      }]
                    ]
                  }
                ]
              }
            }
          },

          // Review metrics
          avgRating: { $arrayElemAt: ['$reviewMetrics.avgRating', 0] },
          reviewCount: { $arrayElemAt: ['$reviewMetrics.reviewCount', 0] },

          // Performance indicators
          salesVelocity: {
            $cond: {
              if: { $gt: [{ $size: '$sales' }, 0] },
              then: {
                $divide: [
                  { $size: '$sales' },
                  {
                    $divide: [
                      {
                        $subtract: [
                          new Date(),
                          { $min: '$sales.createdAt' }
                        ]
                      },
                      30 * 24 * 60 * 60 * 1000 // 30-day periods
                    ]
                  }
                ]
              },
              else: 0
            }
          }
        }
      },

      // Stage 5: Product classification and scoring
      {
        $addFields: {
          // Performance classification
          performanceClass: {
            $switch: {
              branches: [
                {
                  case: {
                    $and: [
                      { $gte: ['$totalRevenue', 10000] },
                      { $gte: ['$uniqueCustomers', 100] },
                      { $gte: [{ $ifNull: ['$avgRating', 0] }, 4.0] }
                    ]
                  },
                  then: 'star'
                },
                {
                  case: {
                    $and: [
                      { $gte: ['$totalRevenue', 5000] },
                      { $gte: ['$uniqueCustomers', 50] },
                      { $gte: [{ $ifNull: ['$avgRating', 0] }, 3.5] }
                    ]
                  },
                  then: 'strong'
                },
                {
                  case: {
                    $and: [
                      { $gte: ['$totalRevenue', 1000] },
                      { $gte: ['$uniqueCustomers', 20] }
                    ]
                  },
                  then: 'growing'
                },
                {
                  case: { $lte: ['$totalSales', 5] },
                  then: 'new'
                }
              ],
              default: 'underperforming'
            }
          },

          // Profitability scoring (simplified model)
          profitabilityScore: {
            $multiply: [
              // Revenue factor (40%)
              { $multiply: [{ $divide: ['$totalRevenue', 1000] }, 0.4] },

              // Customer satisfaction factor (30%)
              { $multiply: [{ $divide: [{ $ifNull: ['$avgRating', 3] }, 5] }, 0.3] },

              // Market penetration factor (20%)
              { $multiply: [{ $divide: ['$uniqueCustomers', 100] }, 0.2] },

              // Sales velocity factor (10%)
              { $multiply: [{ $min: ['$salesVelocity', 10] }, 0.01] }
            ]
          },

          // Inventory turnover estimation
          inventoryTurnover: {
            $cond: {
              if: { $and: [{ $gt: ['$stock', 0] }, { $gt: ['$totalQuantitySold', 0] }] },
              then: { $divide: ['$totalQuantitySold', '$stock'] },
              else: 0
            }
          }
        }
      },

      // Stage 6: Group by category for market analysis
      {
        $group: {
          _id: {
            category: '$category',
            brand: '$brand',
            performanceClass: '$performanceClass'
          },

          productCount: { $sum: 1 },

          // Revenue aggregations
          totalCategoryRevenue: { $sum: '$totalRevenue' },
          avgProductRevenue: { $avg: '$totalRevenue' },
          maxProductRevenue: { $max: '$totalRevenue' },

          // Customer aggregations
          totalUniqueCustomers: { $sum: '$uniqueCustomers' },
          avgCustomersPerProduct: { $avg: '$uniqueCustomers' },

          // Rating aggregations
          avgCategoryRating: { $avg: { $ifNull: ['$avgRating', 0] } },
          avgReviewCount: { $avg: { $ifNull: ['$reviewCount', 0] } },

          // Performance aggregations
          avgProfitabilityScore: { $avg: '$profitabilityScore' },
          avgSalesVelocity: { $avg: '$salesVelocity' },
          avgInventoryTurnover: { $avg: '$inventoryTurnover' },

          // Product examples for reference
          topProducts: {
            $push: {
              $cond: {
                if: { $gte: ['$profitabilityScore', 5] },
                then: {
                  productId: '$_id',
                  name: '$name',
                  revenue: '$totalRevenue',
                  rating: '$avgRating',
                  profitabilityScore: '$profitabilityScore'
                },
                else: '$$REMOVE'
              }
            }
          }
        }
      },

      // Stage 7: Calculate category market share
      {
        $addFields: {
          topProducts: { $slice: [{ $sortArray: { input: '$topProducts', sortBy: { profitabilityScore: -1 } } }, 3] }
        }
      },

      // Stage 8: Add competitive analysis
      {
        $lookup: {
          from: 'products',
          let: { currentCategory: '$_id.category' },
          pipeline: [
            {
              $match: {
                $expr: { $eq: ['$category', '$$currentCategory'] },
                status: 'active'
              }
            },
            {
              $group: {
                _id: null,
                totalCategoryProducts: { $sum: 1 },
                avgCategoryPrice: { $avg: '$price' },
                categoryPriceRange: {
                  min: { $min: '$price' },
                  max: { $max: '$price' }
                }
              }
            }
          ],
          as: 'categoryContext'
        }
      },

      // Stage 9: Final metrics and insights
      {
        $addFields: {
          // Market share within category
          categoryMarketShare: {
            $divide: [
              '$productCount',
              { $arrayElemAt: ['$categoryContext.totalCategoryProducts', 0] }
            ]
          },

          // Performance vs category average
          performanceVsCategory: {
            $divide: [
              '$avgProductRevenue',
              { $arrayElemAt: ['$categoryContext.avgCategoryPrice', 0] }
            ]
          },

          // Strategic recommendations
          strategicRecommendation: {
            $switch: {
              branches: [
                {
                  case: {
                    $and: [
                      { $eq: ['$_id.performanceClass', 'star'] },
                      { $gte: ['$avgInventoryTurnover', 4] }
                    ]
                  },
                  then: 'expand_and_invest'
                },
                {
                  case: {
                    $and: [
                      { $eq: ['$_id.performanceClass', 'strong'] },
                      { $gte: ['$categoryMarketShare', 0.1] }
                    ]
                  },
                  then: 'market_leader_strategy'
                },
                {
                  case: { $eq: ['$_id.performanceClass', 'growing'] },
                  then: 'nurture_and_optimize'
                },
                {
                  case: { $eq: ['$_id.performanceClass', 'underperforming'] },
                  then: 'review_and_improve'
                }
              ],
              default: 'monitor'
            }
          }
        }
      },

      // Stage 10: Clean up and sort
      {
        $project: {
          categoryContext: 0 // Remove lookup data to reduce output size
        }
      },

      {
        $sort: {
          totalCategoryRevenue: -1,
          avgProfitabilityScore: -1,
          '_id.category': 1
        }
      }
    ];

    console.log('Executing product performance analytics pipeline...');
    const startTime = Date.now();

    try {
      const results = await this.collections.products.aggregate(
        productAnalyticsPipeline,
        {
          allowDiskUse: true,
          maxTimeMS: this.performanceTargets.maxExecutionTime,
          cursor: { batchSize: 500 }
        }
      ).toArray();

      const executionTime = Date.now() - startTime;

      console.log(`Product analytics pipeline executed in ${executionTime}ms`);
      console.log(`Analyzed ${results.length} product categories`);

      return {
        results,
        executionStats: {
          executionTime,
          categoriesAnalyzed: results.length,
          performanceGrade: this.calculatePerformanceGrade(executionTime)
        }
      };

    } catch (error) {
      console.error('Product analytics pipeline failed:', error);
      throw error;
    }
  }

  async buildTimeSeriesAnalytics() {
    console.log('Building time-series analytics pipeline...');

    const timeSeriesPipeline = [
      // Stage 1: Match recent orders for trend analysis
      {
        $match: {
          status: 'completed',
          createdAt: {
            $gte: new Date(Date.now() - 18 * 30 * 24 * 60 * 60 * 1000) // 18 months
          }
        }
      },

      // Stage 2: Create time buckets and extract relevant fields
      {
        $addFields: {
          // Multiple time granularities
          yearMonth: { $dateToString: { format: '%Y-%m', date: '$createdAt' } },
          year: { $year: '$createdAt' },
          month: { $month: '$createdAt' },
          quarter: { $ceil: { $divide: [{ $month: '$createdAt' }, 3] } },
          weekOfYear: { $week: '$createdAt' },
          dayOfWeek: { $dayOfWeek: '$createdAt' },
          hourOfDay: { $hour: '$createdAt' },

          // Business metrics
          itemCount: { $size: '$items' },
          avgItemPrice: { $avg: '$items.unitPrice' }
        }
      },

      // Stage 3: Unwind items for product-level analysis
      {
        $unwind: '$items'
      },

      // Stage 4: Group by time periods with comprehensive metrics
      {
        $group: {
          _id: {
            yearMonth: '$yearMonth',
            year: '$year',
            month: '$month',
            quarter: '$quarter',
            category: '$items.category',
            userCountry: '$userCountry'
          },

          // Volume metrics
          orderCount: { $sum: 1 },
          totalRevenue: { $sum: { $multiply: ['$items.quantity', '$items.unitPrice'] } },
          totalQuantity: { $sum: '$items.quantity' },
          uniqueCustomers: { $addToSet: '$userId' },
          uniqueProducts: { $addToSet: '$items.productId' },

          // Average metrics
          avgOrderValue: { $avg: '$totalAmount' },
          avgQuantityPerOrder: { $avg: '$items.quantity' },
          avgUnitPrice: { $avg: '$items.unitPrice' },

          // Distribution metrics
          orderSizes: { $push: '$totalAmount' },
          customerFrequency: { $push: '$userId' },

          // Time-based patterns
          hourDistribution: {
            $push: {
              hour: '$hourOfDay',
              dayOfWeek: '$dayOfWeek',
              amount: '$totalAmount'
            }
          },

          // Product performance
          productMix: {
            $push: {
              productId: '$items.productId',
              category: '$items.category',
              quantity: '$items.quantity',
              revenue: { $multiply: ['$items.quantity', '$items.unitPrice'] }
            }
          }
        }
      },

      // Stage 5: Calculate advanced time-series metrics
      {
        $addFields: {
          uniqueCustomerCount: { $size: '$uniqueCustomers' },
          uniqueProductCount: { $size: '$uniqueProducts' },

          // Customer behavior metrics
          repeatCustomerRate: {
            $divide: [
              {
                $size: {
                  $filter: {
                    input: {
                      $reduce: {
                        input: '$customerFrequency',
                        initialValue: {},
                        in: {
                          $mergeObjects: [
                            '$$value',
                            {
                              $arrayToObject: [
                                [{
                                  k: { $toString: '$$this' },
                                  v: { $add: [{ $ifNull: [{ $getField: { field: { $toString: '$$this' }, input: '$$value' } }, 0] }, 1] }
                                }]
                              ]
                            }
                          ]
                        }
                      }
                    },
                    cond: { $gt: ['$$this.v', 1] }
                  }
                }
              },
              '$uniqueCustomerCount'
            ]
          },

          // Revenue concentration (top 20% of orders)
          revenueConcentration: {
            $let: {
              vars: {
                sortedOrders: { $sortArray: { input: '$orderSizes', sortBy: -1 } },
                top20PercentCount: { $ceil: { $multiply: ['$orderCount', 0.2] } }
              },
              in: {
                $divide: [
                  { $sum: { $slice: ['$$sortedOrders', '$$top20PercentCount'] } },
                  '$totalRevenue'
                ]
              }
            }
          },

          // Peak hour analysis
          peakHours: {
            $let: {
              vars: {
                hourlyTotals: {
                  $reduce: {
                    input: '$hourDistribution',
                    initialValue: {},
                    in: {
                      $mergeObjects: [
                        '$$value',
                        {
                          $arrayToObject: [
                            [{
                              k: { $toString: '$$this.hour' },
                              v: {
                                orders: { $add: [{ $ifNull: [{ $getField: { field: { $toString: '$$this.hour' }, input: '$$value' } }.orders, 0] }, 1] },
                                revenue: { $add: [{ $ifNull: [{ $getField: { field: { $toString: '$$this.hour' }, input: '$$value' } }.revenue, 0] }, '$$this.amount'] }
                              }
                            }]
                          ]
                        }
                      ]
                    }
                  }
                }
              },
              in: {
                $arrayElemAt: [
                  {
                    $sortArray: {
                      input: {
                        $objectToArray: '$$hourlyTotals'
                      },
                      sortBy: { 'v.revenue': -1 }
                    }
                  },
                  0
                ]
              }
            }
          }
        }
      },

      // Stage 6: Sort for time-series analysis
      {
        $sort: {
          '_id.year': 1,
          '_id.month': 1,
          '_id.category': 1,
          '_id.userCountry': 1
        }
      },

      // Stage 7: Window functions for trend analysis
      {
        $setWindowFields: {
          partitionBy: { category: '$_id.category', country: '$_id.userCountry' },
          sortBy: { '_id.year': 1, '_id.month': 1 },
          output: {
            // Moving averages
            movingAvgRevenue: {
              $avg: '$totalRevenue',
              window: { range: [-2, 0], unit: 'position' } // 3-month moving average
            },

            movingAvgOrders: {
              $avg: '$orderCount',
              window: { range: [-2, 0], unit: 'position' }
            },

            // Growth calculations
            prevMonthRevenue: {
              $shift: { output: '$totalRevenue', by: -1 }
            },

            prevYearRevenue: {
              $shift: { output: '$totalRevenue', by: -12 }
            },

            // Ranking
            revenueRank: {
              $denseRank: {}
            },

            // Cumulative metrics
            cumulativeRevenue: {
              $sum: '$totalRevenue',
              window: { range: ['unbounded', 0], unit: 'position' }
            }
          }
        }
      },

      // Stage 8: Calculate growth rates and trends
      {
        $addFields: {
          // Month-over-month growth
          momGrowthRate: {
            $cond: {
              if: { $and: [{ $ne: ['$prevMonthRevenue', null] }, { $gt: ['$prevMonthRevenue', 0] }] },
              then: {
                $multiply: [
                  {
                    $divide: [
                      { $subtract: ['$totalRevenue', '$prevMonthRevenue'] },
                      '$prevMonthRevenue'
                    ]
                  },
                  100
                ]
              },
              else: null
            }
          },

          // Year-over-year growth
          yoyGrowthRate: {
            $cond: {
              if: { $and: [{ $ne: ['$prevYearRevenue', null] }, { $gt: ['$prevYearRevenue', 0] }] },
              then: {
                $multiply: [
                  {
                    $divide: [
                      { $subtract: ['$totalRevenue', '$prevYearRevenue'] },
                      '$prevYearRevenue'
                    ]
                  },
                  100
                ]
              },
              else: null
            }
          },

          // Trend classification
          trendClassification: {
            $switch: {
              branches: [
                {
                  case: {
                    $and: [
                      { $gte: [{ $ifNull: ['$momGrowthRate', 0] }, 10] },
                      { $gte: ['$totalRevenue', '$movingAvgRevenue'] }
                    ]
                  },
                  then: 'strong_growth'
                },
                {
                  case: {
                    $and: [
                      { $gte: [{ $ifNull: ['$momGrowthRate', 0] }, 0] },
                      { $lte: [{ $ifNull: ['$momGrowthRate', 0] }, 10] }
                    ]
                  },
                  then: 'steady_growth'
                },
                {
                  case: { $lt: [{ $ifNull: ['$momGrowthRate', 0] }, -10] },
                  then: 'declining'
                }
              ],
              default: 'stable'
            }
          },

          // Seasonality indicators
          seasonalityScore: {
            $cond: {
              if: { $in: ['$_id.month', [11, 12, 1]] }, // Holiday season
              then: 1.2,
              else: {
                $cond: {
                  if: { $in: ['$_id.month', [6, 7, 8]] }, // Summer
                  then: 0.9,
                  else: 1.0
                }
              }
            }
          }
        }
      },

      // Stage 9: Final grouping for summary insights
      {
        $group: {
          _id: {
            category: '$_id.category',
            country: '$_id.userCountry'
          },

          // Time series data points
          monthlyData: {
            $push: {
              yearMonth: '$_id.yearMonth',
              revenue: '$totalRevenue',
              orders: '$orderCount',
              customers: '$uniqueCustomerCount',
              avgOrderValue: '$avgOrderValue',
              momGrowth: '$momGrowthRate',
              yoyGrowth: '$yoyGrowthRate',
              trend: '$trendClassification'
            }
          },

          // Summary statistics
          totalPeriodRevenue: { $sum: '$totalRevenue' },
          totalPeriodOrders: { $sum: '$orderCount' },
          avgMonthlyRevenue: { $avg: '$totalRevenue' },

          // Growth metrics
          avgMomGrowth: { $avg: { $ifNull: ['$momGrowthRate', 0] } },
          avgYoyGrowth: { $avg: { $ifNull: ['$yoyGrowthRate', 0] } },

          // Volatility measures
          revenueVolatility: { $stdDevPop: '$totalRevenue' },
          orderVolatility: { $stdDevPop: '$orderCount' },

          // Trend analysis
          trendDistribution: {
            $push: '$trendClassification'
          },

          // Peak performance
          peakMonthRevenue: { $max: '$totalRevenue' },
          peakMonthOrders: { $max: '$orderCount' }
        }
      },

      // Stage 10: Final insights and recommendations
      {
        $addFields: {
          // Dominant trend
          dominantTrend: {
            $let: {
              vars: {
                trendCounts: {
                  $reduce: {
                    input: '$trendDistribution',
                    initialValue: {},
                    in: {
                      $mergeObjects: [
                        '$$value',
                        {
                          $arrayToObject: [
                            [{
                              k: '$$this',
                              v: { $add: [{ $ifNull: [{ $getField: { field: '$$this', input: '$$value' } }, 0] }, 1] }
                            }]
                          ]
                        }
                      ]
                    }
                  }
                }
              },
              in: {
                $arrayElemAt: [
                  {
                    $sortArray: {
                      input: { $objectToArray: '$$trendCounts' },
                      sortBy: { v: -1 }
                    }
                  },
                  0
                ]
              }
            }
          },

          // Performance classification
          performanceClassification: {
            $switch: {
              branches: [
                {
                  case: {
                    $and: [
                      { $gte: ['$avgYoyGrowth', 20] },
                      { $lte: ['$revenueVolatility', '$avgMonthlyRevenue'] }
                    ]
                  },
                  then: 'high_growth_stable'
                },
                {
                  case: { $gte: ['$avgYoyGrowth', 20] },
                  then: 'high_growth_volatile'
                },
                {
                  case: {
                    $and: [
                      { $gte: ['$avgYoyGrowth', 5] },
                      { $lte: ['$revenueVolatility', '$avgMonthlyRevenue'] }
                    ]
                  },
                  then: 'steady_growth'
                },
                {
                  case: { $lt: ['$avgYoyGrowth', -5] },
                  then: 'declining'
                }
              ],
              default: 'mature_stable'
            }
          }
        }
      },

      {
        $sort: {
          totalPeriodRevenue: -1,
          avgYoyGrowth: -1
        }
      }
    ];

    console.log('Executing time-series analytics pipeline...');
    const startTime = Date.now();

    try {
      const results = await this.collections.orders.aggregate(
        timeSeriesPipeline,
        {
          allowDiskUse: true,
          maxTimeMS: this.performanceTargets.maxExecutionTime,
          cursor: { batchSize: 100 }
        }
      ).toArray();

      const executionTime = Date.now() - startTime;

      console.log(`Time-series analytics executed in ${executionTime}ms`);
      console.log(`Analyzed ${results.length} category-country combinations`);

      return {
        results,
        executionStats: {
          executionTime,
          timeSeriesAnalyzed: results.length,
          performanceGrade: this.calculatePerformanceGrade(executionTime)
        }
      };

    } catch (error) {
      console.error('Time-series analytics failed:', error);
      throw error;
    }
  }

  calculatePerformanceGrade(executionTimeMs) {
    // Performance grading based on execution time
    if (executionTimeMs <= 1000) return 'A';
    if (executionTimeMs <= 2500) return 'B';
    if (executionTimeMs <= 5000) return 'C';
    if (executionTimeMs <= 10000) return 'D';
    return 'F';
  }

  async optimizePipelinePerformance(pipeline, options = {}) {
    console.log('Optimizing aggregation pipeline performance...');

    const {
      enableIndexHints = true,
      enableDiskUsage = true,
      optimizeBatchSize = true,
      enablePipelineReordering = true
    } = options;

    // Performance optimization strategies
    const optimizedPipeline = [...pipeline];

    if (enablePipelineReordering) {
      // Move $match stages to the beginning
      const matchStages = [];
      const otherStages = [];

      for (const stage of optimizedPipeline) {
        if (stage.$match) {
          matchStages.push(stage);
        } else {
          otherStages.push(stage);
        }
      }

      // Reorder: matches first, then other stages
      optimizedPipeline.length = 0;
      optimizedPipeline.push(...matchStages, ...otherStages);
    }

    // Add $project stages early to reduce data size
    const hasEarlyProject = optimizedPipeline.slice(0, 3).some(stage => stage.$project);
    if (!hasEarlyProject && optimizedPipeline.length > 5) {
      // Insert projection after initial match stages
      const insertIndex = optimizedPipeline.findIndex(stage => !stage.$match) || 1;
      optimizedPipeline.splice(insertIndex, 0, {
        $project: {
          // Project only commonly used fields
          _id: 1,
          status: 1,
          createdAt: 1,
          totalAmount: 1,
          userId: 1,
          items: 1
        }
      });
    }

    // Aggregation options
    const aggregationOptions = {
      allowDiskUse: enableDiskUsage,
      maxTimeMS: this.performanceTargets.maxExecutionTime
    };

    if (optimizeBatchSize) {
      aggregationOptions.cursor = { batchSize: 1000 };
    }

    if (enableIndexHints) {
      // Suggest optimal index based on initial match conditions
      const firstMatch = optimizedPipeline.find(stage => stage.$match);
      if (firstMatch) {
        const matchFields = Object.keys(firstMatch.$match);
        aggregationOptions.hint = this.suggestOptimalIndex(matchFields);
      }
    }

    return {
      optimizedPipeline,
      aggregationOptions,
      optimizations: {
        reorderedStages: enablePipelineReordering,
        addedEarlyProjection: !hasEarlyProject && optimizedPipeline.length > 5,
        indexHint: aggregationOptions.hint || null,
        diskUsageEnabled: enableDiskUsage
      }
    };
  }

  suggestOptimalIndex(matchFields) {
    // Simple heuristic for index suggestion
    const indexSuggestions = {
      status: { status: 1 },
      createdAt: { createdAt: -1 },
      userId: { userId: 1 },
      totalAmount: { totalAmount: -1 }
    };

    // Return compound index if multiple fields
    if (matchFields.length > 1) {
      const compoundIndex = {};
      for (const field of matchFields) {
        if (field === 'createdAt' || field === 'totalAmount') {
          compoundIndex[field] = -1;
        } else {
          compoundIndex[field] = 1;
        }
      }
      return compoundIndex;
    }

    return indexSuggestions[matchFields[0]] || { [matchFields[0]]: 1 };
  }

  async analyzePipelinePerformance(collection, pipeline) {
    console.log('Analyzing pipeline performance...');

    try {
      // Execute explain to get execution statistics
      const explainResult = await collection.aggregate(pipeline).explain('executionStats');

      const analysis = {
        totalExecutionTime: this.extractExecutionTime(explainResult),
        stageBreakdown: this.analyzeStagePerformance(explainResult),
        indexUsage: this.analyzeIndexUsage(explainResult),
        memoryUsage: this.estimateMemoryUsage(explainResult),
        recommendations: []
      };

      // Generate optimization recommendations
      analysis.recommendations = this.generatePipelineRecommendations(analysis);

      return analysis;

    } catch (error) {
      console.error('Pipeline analysis failed:', error);
      return {
        error: error.message,
        recommendations: ['Unable to analyze pipeline - check syntax and data availability']
      };
    }
  }

  extractExecutionTime(explainResult) {
    // Extract execution time from explain result
    if (explainResult.stages && explainResult.stages.length > 0) {
      const lastStage = explainResult.stages[explainResult.stages.length - 1];
      return lastStage.$cursor?.executionStats?.executionTimeMillis || 0;
    }
    return 0;
  }

  analyzeStagePerformance(explainResult) {
    // Analyze performance of individual pipeline stages
    if (!explainResult.stages) return [];

    return explainResult.stages.map((stage, index) => {
      const stageInfo = {
        stageIndex: index,
        stageType: Object.keys(stage)[0],
        executionTime: 0,
        documentsProcessed: 0,
        documentsOutput: 0
      };

      // Extract stage-specific metrics
      if (stage.$cursor?.executionStats) {
        stageInfo.executionTime = stage.$cursor.executionStats.executionTimeMillis;
        stageInfo.documentsProcessed = stage.$cursor.executionStats.totalDocsExamined;
        stageInfo.documentsOutput = stage.$cursor.executionStats.totalDocsReturned;
      }

      return stageInfo;
    });
  }

  analyzeIndexUsage(explainResult) {
    // Analyze index usage patterns
    const indexUsage = {
      indexesUsed: [],
      collectionScans: 0,
      indexScans: 0,
      efficiency: 0
    };

    // Implementation would analyze explain result for index usage
    // This is a simplified version

    return indexUsage;
  }

  estimateMemoryUsage(explainResult) {
    // Estimate memory usage based on pipeline operations
    let estimatedMemory = 0;

    if (explainResult.stages) {
      for (const stage of explainResult.stages) {
        // Estimate memory for different stage types
        const stageType = Object.keys(stage)[0];

        switch (stageType) {
          case '$group':
            estimatedMemory += 10; // MB estimate
            break;
          case '$sort':
            estimatedMemory += 20; // MB estimate
            break;
          case '$lookup':
            estimatedMemory += 15; // MB estimate
            break;
          default:
            estimatedMemory += 2; // MB estimate
        }
      }
    }

    return estimatedMemory;
  }

  generatePipelineRecommendations(analysis) {
    const recommendations = [];

    // High execution time
    if (analysis.totalExecutionTime > this.performanceTargets.maxExecutionTime) {
      recommendations.push({
        type: 'PERFORMANCE_WARNING',
        message: `Pipeline execution time (${analysis.totalExecutionTime}ms) exceeds target`,
        suggestion: 'Consider adding indexes, reducing data volume, or optimizing pipeline stages'
      });
    }

    // High memory usage
    if (analysis.memoryUsage > this.performanceTargets.maxMemoryUsage) {
      recommendations.push({
        type: 'MEMORY_WARNING',
        message: `Estimated memory usage (${analysis.memoryUsage}MB) may cause performance issues`,
        suggestion: 'Enable allowDiskUse option or reduce pipeline complexity'
      });
    }

    // Collection scans detected
    if (analysis.indexUsage.collectionScans > 0) {
      recommendations.push({
        type: 'INDEX_MISSING',
        message: 'Pipeline includes collection scans',
        suggestion: 'Create indexes for fields used in $match stages'
      });
    }

    return recommendations;
  }
}

// Benefits of MongoDB Advanced Aggregation Pipelines:
// - Flexible multi-stage data processing with optimizable pipeline ordering
// - Rich aggregation operators supporting complex calculations and transformations
// - Built-in memory management with disk usage options for large datasets
// - Advanced analytical capabilities including window functions and time-series analysis
// - Efficient handling of nested documents and array operations
// - Comprehensive performance monitoring and optimization recommendations
// - Integration with MongoDB's query optimizer and index system
// - Support for real-time analytics and complex business intelligence queries
// - Scalable architecture that works across replica sets and sharded clusters
// - SQL-familiar aggregation patterns through QueryLeaf integration

module.exports = {
  MongoAggregationOptimizer
};

Understanding MongoDB Aggregation Architecture

Advanced Pipeline Design Patterns and Optimization Strategies

Implement sophisticated aggregation patterns for optimal performance and analytical capabilities:

// Advanced aggregation patterns for specialized analytical use cases
class AdvancedAggregationPatterns {
  constructor(db) {
    this.db = db;
    this.performanceMetrics = new Map();
    this.pipelineTemplates = new Map();
  }

  async implementRealTimeAnalytics() {
    console.log('Implementing real-time analytics aggregation patterns...');

    // Real-time dashboard metrics with incremental processing
    const realTimeDashboardPipeline = [
      // Stage 1: Match recent data only (last 24 hours)
      {
        $match: {
          createdAt: { $gte: new Date(Date.now() - 24 * 60 * 60 * 1000) },
          status: { $in: ['completed', 'processing'] }
        }
      },

      // Stage 2: Fast aggregation for key metrics
      {
        $facet: {
          // Revenue metrics
          revenueMetrics: [
            {
              $group: {
                _id: null,
                totalRevenue: { $sum: '$totalAmount' },
                orderCount: { $sum: 1 },
                avgOrderValue: { $avg: '$totalAmount' },
                maxOrderValue: { $max: '$totalAmount' }
              }
            }
          ],

          // Hourly breakdown
          hourlyBreakdown: [
            {
              $group: {
                _id: { hour: { $hour: '$createdAt' } },
                revenue: { $sum: '$totalAmount' },
                orders: { $sum: 1 }
              }
            },
            { $sort: { '_id.hour': 1 } }
          ],

          // Top products (by revenue)
          topProducts: [
            { $unwind: '$items' },
            {
              $group: {
                _id: '$items.productId',
                revenue: { $sum: { $multiply: ['$items.quantity', '$items.unitPrice'] } },
                quantity: { $sum: '$items.quantity' }
              }
            },
            { $sort: { revenue: -1 } },
            { $limit: 10 }
          ],

          // Geographic distribution
          geoDistribution: [
            {
              $group: {
                _id: '$shippingAddress.country',
                orders: { $sum: 1 },
                revenue: { $sum: '$totalAmount' }
              }
            },
            { $sort: { revenue: -1 } },
            { $limit: 20 }
          ],

          // Customer segments
          customerSegments: [
            {
              $group: {
                _id: {
                  segment: {
                    $switch: {
                      branches: [
                        { case: { $gte: ['$totalAmount', 500] }, then: 'premium' },
                        { case: { $gte: ['$totalAmount', 100] }, then: 'standard' }
                      ],
                      default: 'basic'
                    }
                  }
                },
                count: { $sum: 1 },
                revenue: { $sum: '$totalAmount' }
              }
            }
          ]
        }
      }
    ];

    const realTimeResults = await this.db.collection('orders').aggregate(
      realTimeDashboardPipeline,
      { maxTimeMS: 1000 } // 1 second timeout for real-time
    ).toArray();

    console.log('Real-time analytics completed');
    return realTimeResults[0];
  }

  async implementCustomerLifecycleAnalysis() {
    console.log('Building customer lifecycle analysis pipeline...');

    const lifecyclePipeline = [
      // Stage 1: Get all customers with their order history
      {
        $lookup: {
          from: 'orders',
          localField: '_id',
          foreignField: 'userId',
          as: 'orders',
          pipeline: [
            { $match: { status: 'completed' } },
            { $sort: { createdAt: 1 } },
            {
              $project: {
                createdAt: 1,
                totalAmount: 1,
                daysSinceRegistration: {
                  $divide: [
                    { $subtract: ['$createdAt', '$$ROOT.createdAt'] },
                    24 * 60 * 60 * 1000
                  ]
                }
              }
            }
          ]
        }
      },

      // Stage 2: Calculate lifecycle metrics
      {
        $addFields: {
          // Basic lifecycle metrics
          totalOrders: { $size: '$orders' },
          totalSpent: { $sum: '$orders.totalAmount' },
          avgOrderValue: { $avg: '$orders.totalAmount' },

          // Timing analysis
          firstOrderDate: { $min: '$orders.createdAt' },
          lastOrderDate: { $max: '$orders.createdAt' },
          customerLifespanDays: {
            $divide: [
              { $subtract: [{ $max: '$orders.createdAt' }, { $min: '$orders.createdAt' }] },
              24 * 60 * 60 * 1000
            ]
          },

          // Purchase intervals
          orderIntervals: {
            $map: {
              input: { $range: [1, { $size: '$orders' }] },
              as: 'idx',
              in: {
                $divide: [
                  {
                    $subtract: [
                      { $arrayElemAt: ['$orders.createdAt', '$$idx'] },
                      { $arrayElemAt: ['$orders.createdAt', { $subtract: ['$$idx', 1] }] }
                    ]
                  },
                  24 * 60 * 60 * 1000 // Convert to days
                ]
              }
            }
          },

          // CLV calculation (simplified)
          estimatedCLV: {
            $multiply: [
              { $avg: '$orders.totalAmount' }, // Average order value
              { $size: '$orders' }, // Order frequency
              {
                $cond: {
                  if: { $gt: [{ $size: '$orders' }, 1] },
                  then: {
                    $divide: [
                      365, // Days in year
                      { $avg: '$orderIntervals' } // Average days between orders
                    ]
                  },
                  else: 1
                }
              }
            ]
          }
        }
      },

      // Stage 3: Lifecycle stage classification
      {
        $addFields: {
          lifecycleStage: {
            $switch: {
              branches: [
                {
                  case: { $eq: ['$totalOrders', 1] },
                  then: 'new_customer'
                },
                {
                  case: {
                    $and: [
                      { $gte: ['$totalOrders', 2] },
                      { $lte: ['$totalOrders', 5] },
                      {
                        $gte: [
                          '$lastOrderDate',
                          new Date(Date.now() - 90 * 24 * 60 * 60 * 1000)
                        ]
                      }
                    ]
                  },
                  then: 'developing'
                },
                {
                  case: {
                    $and: [
                      { $gt: ['$totalOrders', 5] },
                      { $gte: ['$totalSpent', 500] },
                      {
                        $gte: [
                          '$lastOrderDate',
                          new Date(Date.now() - 180 * 24 * 60 * 60 * 1000)
                        ]
                      }
                    ]
                  },
                  then: 'loyal'
                },
                {
                  case: {
                    $and: [
                      { $gt: ['$totalOrders', 10] },
                      { $gte: ['$totalSpent', 2000] }
                    ]
                  },
                  then: 'champion'
                },
                {
                  case: {
                    $lt: [
                      '$lastOrderDate',
                      new Date(Date.now() - 365 * 24 * 60 * 60 * 1000)
                    ]
                  },
                  then: 'dormant'
                }
              ],
              default: 'at_risk'
            }
          },

          // Churn risk scoring
          churnRisk: {
            $let: {
              vars: {
                daysSinceLastOrder: {
                  $divide: [
                    { $subtract: [new Date(), '$lastOrderDate'] },
                    24 * 60 * 60 * 1000
                  ]
                },
                avgInterval: { $avg: '$orderIntervals' }
              },
              in: {
                $switch: {
                  branches: [
                    {
                      case: { $gt: ['$$daysSinceLastOrder', { $multiply: ['$$avgInterval', 3] }] },
                      then: 'high'
                    },
                    {
                      case: { $gt: ['$$daysSinceLastOrder', { $multiply: ['$$avgInterval', 2] }] },
                      then: 'medium'
                    }
                  ],
                  default: 'low'
                }
              }
            }
          }
        }
      },

      // Stage 4: Group by lifecycle stage for analysis
      {
        $group: {
          _id: {
            lifecycleStage: '$lifecycleStage',
            churnRisk: '$churnRisk'
          },

          customerCount: { $sum: 1 },
          totalRevenue: { $sum: '$totalSpent' },
          avgCLV: { $avg: '$estimatedCLV' },
          avgLifespan: { $avg: '$customerLifespanDays' },
          avgOrderFrequency: { $avg: { $avg: '$orderIntervals' } },

          // Statistical measures
          clvDistribution: {
            $push: {
              $bucket: {
                groupBy: '$estimatedCLV',
                boundaries: [0, 100, 500, 1000, 5000, 10000],
                default: 'high_value'
              }
            }
          }
        }
      },

      {
        $sort: { totalRevenue: -1 }
      }
    ];

    console.log('Executing customer lifecycle analysis...');
    const results = await this.db.collection('users').aggregate(lifecyclePipeline, {
      allowDiskUse: true,
      maxTimeMS: 30000
    }).toArray();

    return results;
  }

  async implementAdvancedTextAnalysis() {
    console.log('Building advanced text analysis pipeline...');

    // Advanced text analysis for reviews and feedback
    const textAnalysisPipeline = [
      // Stage 1: Match published reviews
      {
        $match: {
          status: 'published',
          reviewText: { $exists: true, $ne: '' }
        }
      },

      // Stage 2: Text processing and sentiment analysis
      {
        $addFields: {
          // Text metrics
          wordCount: {
            $size: {
              $split: [{ $trim: { input: '$reviewText' } }, ' ']
            }
          },

          // Sentiment indicators (simplified keyword approach)
          positiveWords: {
            $size: {
              $filter: {
                input: {
                  $split: [
                    { $toLower: '$reviewText' },
                    ' '
                  ]
                },
                cond: {
                  $in: [
                    '$$this',
                    ['excellent', 'great', 'amazing', 'love', 'perfect', 'awesome', 'fantastic', 'wonderful', 'outstanding', 'superb']
                  ]
                }
              }
            }
          },

          negativeWords: {
            $size: {
              $filter: {
                input: {
                  $split: [
                    { $toLower: '$reviewText' },
                    ' '
                  ]
                },
                cond: {
                  $in: [
                    '$$this',
                    ['terrible', 'awful', 'bad', 'horrible', 'worst', 'hate', 'disappointing', 'useless', 'broken', 'defective']
                  ]
                }
              }
            }
          },

          // Quality indicators
          qualityKeywords: {
            $size: {
              $filter: {
                input: {
                  $split: [
                    { $toLower: '$reviewText' },
                    ' '
                  ]
                },
                cond: {
                  $in: [
                    '$$this',
                    ['quality', 'durable', 'sturdy', 'well-made', 'premium', 'solid', 'reliable', 'long-lasting']
                  ]
                }
              }
            }
          },

          // Service indicators
          serviceKeywords: {
            $size: {
              $filter: {
                input: {
                  $split: [
                    { $toLower: '$reviewText' },
                    ' '
                  ]
                },
                cond: {
                  $in: [
                    '$$this',
                    ['service', 'support', 'shipping', 'delivery', 'customer', 'help', 'staff', 'team']
                  ]
                }
              }
            }
          }
        }
      },

      // Stage 3: Sentiment scoring
      {
        $addFields: {
          sentimentScore: {
            $subtract: ['$positiveWords', '$negativeWords']
          },

          sentimentCategory: {
            $switch: {
              branches: [
                {
                  case: { $gte: [{ $subtract: ['$positiveWords', '$negativeWords'] }, 2] },
                  then: 'very_positive'
                },
                {
                  case: { $gte: [{ $subtract: ['$positiveWords', '$negativeWords'] }, 1] },
                  then: 'positive'
                },
                {
                  case: { $lte: [{ $subtract: ['$positiveWords', '$negativeWords'] }, -2] },
                  then: 'very_negative'
                },
                {
                  case: { $lte: [{ $subtract: ['$positiveWords', '$negativeWords'] }, -1] },
                  then: 'negative'
                }
              ],
              default: 'neutral'
            }
          },

          reviewQuality: {
            $switch: {
              branches: [
                {
                  case: {
                    $and: [
                      { $gte: ['$wordCount', 50] },
                      { $gte: ['$rating', 4] },
                      { $gte: ['$helpfulVotes', 3] }
                    ]
                  },
                  then: 'high_quality'
                },
                {
                  case: {
                    $and: [
                      { $gte: ['$wordCount', 20] },
                      { $or: [{ $gte: ['$rating', 4] }, { $lte: ['$rating', 2] }] }
                    ]
                  },
                  then: 'moderate_quality'
                }
              ],
              default: 'low_quality'
            }
          }
        }
      },

      // Stage 4: Group by product for analysis
      {
        $group: {
          _id: '$productId',

          // Review volume metrics
          totalReviews: { $sum: 1 },
          avgRating: { $avg: '$rating' },
          ratingDistribution: {
            $push: {
              rating: '$rating',
              sentiment: '$sentimentCategory'
            }
          },

          // Text analysis metrics
          avgWordCount: { $avg: '$wordCount' },
          avgSentimentScore: { $avg: '$sentimentScore' },

          // Sentiment distribution
          veryPositive: {
            $sum: { $cond: [{ $eq: ['$sentimentCategory', 'very_positive'] }, 1, 0] }
          },
          positive: {
            $sum: { $cond: [{ $eq: ['$sentimentCategory', 'positive'] }, 1, 0] }
          },
          neutral: {
            $sum: { $cond: [{ $eq: ['$sentimentCategory', 'neutral'] }, 1, 0] }
          },
          negative: {
            $sum: { $cond: [{ $eq: ['$sentimentCategory', 'negative'] }, 1, 0] }
          },
          veryNegative: {
            $sum: { $cond: [{ $eq: ['$sentimentCategory', 'very_negative'] }, 1, 0] }
          },

          // Quality and service mentions
          qualityMentions: { $sum: '$qualityKeywords' },
          serviceMentions: { $sum: '$serviceKeywords' },

          // Review quality distribution
          highQualityReviews: {
            $sum: { $cond: [{ $eq: ['$reviewQuality', 'high_quality'] }, 1, 0] }
          },

          // Most helpful reviews
          topReviews: {
            $push: {
              $cond: {
                if: { $gte: ['$helpfulVotes', 5] },
                then: {
                  reviewId: '$_id',
                  rating: '$rating',
                  sentiment: '$sentimentCategory',
                  helpfulVotes: '$helpfulVotes',
                  wordCount: '$wordCount'
                },
                else: '$$REMOVE'
              }
            }
          }
        }
      },

      // Stage 5: Calculate comprehensive text metrics
      {
        $addFields: {
          // Overall sentiment ratio
          positiveRatio: {
            $divide: [
              { $add: ['$veryPositive', '$positive'] },
              '$totalReviews'
            ]
          },

          negativeRatio: {
            $divide: [
              { $add: ['$negative', '$veryNegative'] },
              '$totalReviews'
            ]
          },

          // Quality score
          qualityScore: {
            $add: [
              // Rating component (40%)
              { $multiply: [{ $divide: ['$avgRating', 5] }, 40] },

              // Sentiment component (30%)
              { $multiply: [{ $divide: [{ $add: ['$veryPositive', '$positive'] }, '$totalReviews'] }, 30] },

              // Review depth component (20%)
              { $multiply: [{ $min: [{ $divide: ['$avgWordCount', 100] }, 1] }, 20] },

              // Quality mentions component (10%)
              { $multiply: [{ $min: [{ $divide: ['$qualityMentions', '$totalReviews'] }, 1] }, 10] }
            ]
          },

          // Text analysis insights
          textInsights: {
            dominantSentiment: {
              $switch: {
                branches: [
                  { case: { $gte: ['$veryPositive', { $max: ['$positive', '$neutral', '$negative', '$veryNegative'] }] }, then: 'very_positive' },
                  { case: { $gte: ['$positive', { $max: ['$neutral', '$negative', '$veryNegative'] }] }, then: 'positive' },
                  { case: { $gte: ['$neutral', { $max: ['$negative', '$veryNegative'] }] }, then: 'neutral' },
                  { case: { $gte: ['$negative', '$veryNegative'] }, then: 'negative' }
                ],
                default: 'very_negative'
              }
            },

            reviewEngagement: {
              $divide: ['$highQualityReviews', '$totalReviews']
            },

            serviceAttention: {
              $divide: ['$serviceMentions', '$totalReviews']
            }
          }
        }
      },

      // Stage 6: Sort by quality score
      {
        $sort: { qualityScore: -1 }
      },

      // Stage 7: Lookup product information
      {
        $lookup: {
          from: 'products',
          localField: '_id',
          foreignField: '_id',
          as: 'product',
          pipeline: [
            {
              $project: {
                name: 1,
                category: 1,
                brand: 1,
                price: 1
              }
            }
          ]
        }
      },

      {
        $addFields: {
          productInfo: { $arrayElemAt: ['$product', 0] }
        }
      },

      {
        $project: {
          product: 0 // Remove array field
        }
      }
    ];

    console.log('Executing advanced text analysis...');
    const results = await this.db.collection('reviews').aggregate(textAnalysisPipeline, {
      allowDiskUse: true,
      maxTimeMS: 45000
    }).toArray();

    return results;
  }

  async monitorPipelinePerformance() {
    console.log('Monitoring aggregation pipeline performance...');

    const performanceMetrics = {
      collections: {},
      systemMetrics: {},
      recommendations: []
    };

    // Analyze recent aggregation operations
    try {
      const recentAggregations = await this.db.collection('system.profile').find({
        ts: { $gte: new Date(Date.now() - 24 * 60 * 60 * 1000) },
        'command.aggregate': { $exists: true },
        millis: { $gte: 1000 } // Operations taking more than 1 second
      }).sort({ millis: -1 }).limit(20).toArray();

      for (const aggOp of recentAggregations) {
        const analysis = {
          collection: aggOp.command.aggregate,
          duration: aggOp.millis,
          stages: aggOp.command.pipeline ? aggOp.command.pipeline.length : 0,
          allowDiskUse: aggOp.command.allowDiskUse || false,
          timestamp: aggOp.ts
        };

        if (!performanceMetrics.collections[analysis.collection]) {
          performanceMetrics.collections[analysis.collection] = {
            operations: [],
            avgDuration: 0,
            slowOperations: 0
          };
        }

        performanceMetrics.collections[analysis.collection].operations.push(analysis);

        if (analysis.duration > 5000) {
          performanceMetrics.collections[analysis.collection].slowOperations++;
        }
      }

      // Calculate averages and generate recommendations
      for (const [collection, metrics] of Object.entries(performanceMetrics.collections)) {
        const operations = metrics.operations;
        metrics.avgDuration = operations.reduce((sum, op) => sum + op.duration, 0) / operations.length;

        if (metrics.avgDuration > 10000) {
          performanceMetrics.recommendations.push({
            type: 'PERFORMANCE_WARNING',
            collection: collection,
            message: `Average aggregation duration (${metrics.avgDuration}ms) is high`,
            suggestions: [
              'Review pipeline stage ordering',
              'Add appropriate indexes',
              'Enable allowDiskUse for large datasets',
              'Consider data preprocessing'
            ]
          });
        }

        if (metrics.slowOperations > operations.length * 0.5) {
          performanceMetrics.recommendations.push({
            type: 'FREQUENT_SLOW_OPERATIONS',
            collection: collection,
            message: `${metrics.slowOperations} of ${operations.length} operations are slow`,
            suggestions: [
              'Optimize pipeline stages',
              'Review data volume and filtering',
              'Consider aggregation result caching'
            ]
          });
        }
      }

    } catch (error) {
      console.warn('Could not analyze aggregation performance:', error.message);
    }

    return performanceMetrics;
  }
}

SQL-Style Aggregation with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB aggregation operations:

-- QueryLeaf aggregation with SQL-familiar syntax

-- Complex analytical query with multiple aggregation levels
WITH customer_analytics AS (
  SELECT 
    u.country,
    u.registration_year,
    u.status,

    -- Customer metrics
    COUNT(*) as customer_count,
    AVG(u.total_spent) as avg_customer_value,
    SUM(u.total_spent) as total_revenue,

    -- Customer segmentation
    COUNT(CASE WHEN u.total_spent > 1000 THEN 1 END) as high_value_customers,
    COUNT(CASE WHEN u.total_spent BETWEEN 100 AND 1000 THEN 1 END) as medium_value_customers,
    COUNT(CASE WHEN u.total_spent < 100 THEN 1 END) as low_value_customers,

    -- Behavioral metrics
    AVG(u.order_count) as avg_orders_per_customer,
    AVG(DATEDIFF(CURRENT_DATE, u.last_order_date)) as avg_days_since_last_order,

    -- Geographic performance
    COUNT(DISTINCT u.state) as states_served,
    COUNT(DISTINCT u.city) as cities_served,

    -- Temporal analysis
    COUNT(CASE WHEN u.last_login >= CURRENT_DATE - INTERVAL '30 days' THEN 1 END) as active_users,
    COUNT(CASE WHEN u.last_login < CURRENT_DATE - INTERVAL '90 days' THEN 1 END) as inactive_users

  FROM users u
  WHERE u.created_at >= CURRENT_DATE - INTERVAL '2 years'
    AND u.status != 'deleted'
  GROUP BY u.country, u.registration_year, u.status
),

product_performance AS (
  SELECT 
    p.category,
    p.brand,

    -- Product metrics
    COUNT(*) as product_count,
    AVG(p.price) as avg_price,
    SUM(COALESCE(p.total_sales, 0)) as category_sales,

    -- Performance indicators
    AVG(p.rating) as avg_rating,
    COUNT(CASE WHEN p.rating >= 4.0 THEN 1 END) as highly_rated_products,
    COUNT(CASE WHEN p.stock_level < 10 THEN 1 END) as low_stock_products,

    -- Revenue analysis with complex calculations
    SUM(p.price * COALESCE(p.units_sold, 0)) as gross_revenue,
    AVG(p.price * COALESCE(p.units_sold, 0)) as avg_product_revenue,

    -- Market penetration
    COUNT(DISTINCT p.supplier_id) as supplier_diversity,

    -- Product lifecycle analysis
    COUNT(CASE WHEN p.created_at >= CURRENT_DATE - INTERVAL '6 months' THEN 1 END) as new_products,
    COUNT(CASE WHEN p.last_sold < CURRENT_DATE - INTERVAL '3 months' THEN 1 END) as stale_products,

    -- Statistical measures
    STDDEV(p.price) as price_variance,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY p.price) as median_price,
    PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY p.price) as price_p90

  FROM products p
  WHERE p.status = 'active'
  GROUP BY p.category, p.brand
  HAVING COUNT(*) >= 5  -- Categories with sufficient products
),

time_series_analysis AS (
  SELECT 
    DATE_TRUNC('month', o.created_at) as month,
    o.customer_country,

    -- Volume metrics
    COUNT(*) as order_count,
    SUM(o.total_amount) as monthly_revenue,
    COUNT(DISTINCT o.user_id) as unique_customers,

    -- Average metrics
    AVG(o.total_amount) as avg_order_value,
    AVG(JSON_LENGTH(o.items)) as avg_items_per_order,

    -- Growth calculations using window functions
    LAG(SUM(o.total_amount)) OVER (
      PARTITION BY o.customer_country 
      ORDER BY DATE_TRUNC('month', o.created_at)
    ) as prev_month_revenue,

    LAG(SUM(o.total_amount), 12) OVER (
      PARTITION BY o.customer_country 
      ORDER BY DATE_TRUNC('month', o.created_at)
    ) as same_month_last_year,

    -- Cumulative metrics
    SUM(SUM(o.total_amount)) OVER (
      PARTITION BY o.customer_country 
      ORDER BY DATE_TRUNC('month', o.created_at)
      ROWS UNBOUNDED PRECEDING
    ) as cumulative_revenue,

    -- Moving averages
    AVG(SUM(o.total_amount)) OVER (
      PARTITION BY o.customer_country 
      ORDER BY DATE_TRUNC('month', o.created_at)
      ROWS 2 PRECEDING
    ) as three_month_avg_revenue,

    -- Rankings
    RANK() OVER (
      PARTITION BY DATE_TRUNC('month', o.created_at)
      ORDER BY SUM(o.total_amount) DESC
    ) as monthly_country_rank

  FROM orders o
  WHERE o.status = 'completed'
    AND o.created_at >= CURRENT_DATE - INTERVAL '18 months'
  GROUP BY DATE_TRUNC('month', o.created_at), o.customer_country
),

advanced_text_analysis AS (
  SELECT 
    r.product_id,
    p.category,

    -- Review volume and ratings
    COUNT(*) as review_count,
    AVG(r.rating) as avg_rating,

    -- Sentiment analysis using text functions
    COUNT(CASE 
      WHEN LOWER(r.review_text) SIMILAR TO '%(excellent|great|amazing|love|perfect)%' 
      THEN 1 
    END) as positive_reviews,

    COUNT(CASE 
      WHEN LOWER(r.review_text) SIMILAR TO '%(terrible|awful|bad|horrible|hate)%' 
      THEN 1 
    END) as negative_reviews,

    -- Text quality metrics
    AVG(LENGTH(r.review_text)) as avg_review_length,
    COUNT(CASE WHEN LENGTH(r.review_text) > 100 THEN 1 END) as detailed_reviews,

    -- Helpfulness metrics
    AVG(r.helpful_votes) as avg_helpfulness,
    COUNT(CASE WHEN r.helpful_votes >= 5 THEN 1 END) as highly_helpful_reviews,

    -- Topic analysis using keyword matching
    COUNT(CASE 
      WHEN LOWER(r.review_text) SIMILAR TO '%(quality|durable|sturdy|well-made)%' 
      THEN 1 
    END) as quality_mentions,

    COUNT(CASE 
      WHEN LOWER(r.review_text) SIMILAR TO '%(shipping|delivery|fast|quick)%' 
      THEN 1 
    END) as shipping_mentions,

    -- Rating distribution analysis
    JSON_OBJECT(
      'rating_5', COUNT(CASE WHEN r.rating = 5 THEN 1 END),
      'rating_4', COUNT(CASE WHEN r.rating = 4 THEN 1 END),
      'rating_3', COUNT(CASE WHEN r.rating = 3 THEN 1 END),
      'rating_2', COUNT(CASE WHEN r.rating = 2 THEN 1 END),
      'rating_1', COUNT(CASE WHEN r.rating = 1 THEN 1 END)
    ) as rating_distribution

  FROM reviews r
  JOIN products p ON r.product_id = p.id
  WHERE r.status = 'published'
    AND r.created_at >= CURRENT_DATE - INTERVAL '1 year'
  GROUP BY r.product_id, p.category
  HAVING COUNT(*) >= 10  -- Products with sufficient reviews
)

-- Final comprehensive analysis combining all CTEs
SELECT 
  ca.country,
  ca.customer_count,
  ca.total_revenue,
  ROUND(ca.avg_customer_value, 2) as avg_customer_ltv,

  -- Customer segmentation percentages
  ROUND((ca.high_value_customers / ca.customer_count::float) * 100, 1) as high_value_pct,
  ROUND((ca.medium_value_customers / ca.customer_count::float) * 100, 1) as medium_value_pct,
  ROUND((ca.low_value_customers / ca.customer_count::float) * 100, 1) as low_value_pct,

  -- Activity metrics
  ROUND((ca.active_users / ca.customer_count::float) * 100, 1) as active_user_pct,
  ROUND(ca.avg_orders_per_customer, 1) as avg_orders_per_customer,

  -- Product ecosystem metrics
  (SELECT COUNT(DISTINCT pp.category) 
   FROM product_performance pp) as total_categories,

  (SELECT AVG(pp.avg_rating) 
   FROM product_performance pp) as overall_product_rating,

  -- Time series insights (latest month data)
  (SELECT tsa.monthly_revenue 
   FROM time_series_analysis tsa 
   WHERE tsa.customer_country = ca.country 
   ORDER BY tsa.month DESC 
   LIMIT 1) as latest_month_revenue,

  -- Growth rate calculation
  (SELECT 
     CASE 
       WHEN tsa.prev_month_revenue > 0 THEN
         ROUND(((tsa.monthly_revenue - tsa.prev_month_revenue) / tsa.prev_month_revenue * 100), 2)
       ELSE NULL
     END
   FROM time_series_analysis tsa 
   WHERE tsa.customer_country = ca.country 
   ORDER BY tsa.month DESC 
   LIMIT 1) as mom_growth_rate,

  -- Year over year growth
  (SELECT 
     CASE 
       WHEN tsa.same_month_last_year > 0 THEN
         ROUND(((tsa.monthly_revenue - tsa.same_month_last_year) / tsa.same_month_last_year * 100), 2)
       ELSE NULL
     END
   FROM time_series_analysis tsa 
   WHERE tsa.customer_country = ca.country 
   ORDER BY tsa.month DESC 
   LIMIT 1) as yoy_growth_rate,

  -- Text sentiment analysis
  (SELECT 
     ROUND(AVG(ata.positive_reviews / ata.review_count::float) * 100, 1)
   FROM advanced_text_analysis ata) as avg_positive_sentiment_pct,

  -- Quality perception
  (SELECT 
     ROUND(AVG(ata.quality_mentions / ata.review_count::float) * 100, 1)
   FROM advanced_text_analysis ata) as quality_mention_pct,

  -- Strategic classification
  CASE 
    WHEN ca.total_revenue > 100000 AND ca.high_value_customers > ca.customer_count * 0.2 THEN 'key_market'
    WHEN ca.total_revenue > 50000 AND ca.active_users > ca.customer_count * 0.6 THEN 'growth_market'
    WHEN ca.inactive_users > ca.customer_count * 0.5 THEN 'retention_focus'
    ELSE 'development_market'
  END as market_classification,

  -- Opportunity scoring
  (ca.total_revenue * 0.4 + 
   ca.customer_count * 10 * 0.3 + 
   ca.active_users * 15 * 0.3) as opportunity_score

FROM customer_analytics ca
WHERE ca.customer_count >= 50  -- Markets with sufficient size
ORDER BY ca.total_revenue DESC, ca.customer_count DESC;

-- Real-time dashboard query with faceted aggregation
SELECT 
  -- Today's metrics
  'today_metrics' as facet,
  JSON_OBJECT(
    'orders', COUNT(CASE WHEN o.created_at >= CURRENT_DATE THEN 1 END),
    'revenue', SUM(CASE WHEN o.created_at >= CURRENT_DATE THEN o.total_amount ELSE 0 END),
    'customers', COUNT(DISTINCT CASE WHEN o.created_at >= CURRENT_DATE THEN o.user_id END),
    'avg_order_value', AVG(CASE WHEN o.created_at >= CURRENT_DATE THEN o.total_amount END)
  ) as metrics
FROM orders o
WHERE o.status = 'completed' 
  AND o.created_at >= CURRENT_DATE - INTERVAL '1 day'

UNION ALL

-- Hourly breakdown for today
SELECT 
  'hourly_breakdown' as facet,
  JSON_OBJECT(
    'data', JSON_ARRAYAGG(
      JSON_OBJECT(
        'hour', EXTRACT(HOUR FROM o.created_at),
        'orders', COUNT(*),
        'revenue', SUM(o.total_amount)
      )
    )
  ) as metrics
FROM orders o
WHERE o.status = 'completed'
  AND o.created_at >= CURRENT_DATE
GROUP BY EXTRACT(HOUR FROM o.created_at)

UNION ALL

-- Top performing products today  
SELECT 
  'top_products' as facet,
  JSON_OBJECT(
    'data', JSON_ARRAYAGG(
      JSON_OBJECT(
        'product_id', oi.product_id,
        'revenue', SUM(oi.quantity * oi.unit_price),
        'units_sold', SUM(oi.quantity)
      )
    )
  ) as metrics
FROM orders o
JOIN JSON_TABLE(o.items, '$[*]' COLUMNS (
  product_id VARCHAR(50) PATH '$.productId',
  quantity INT PATH '$.quantity', 
  unit_price DECIMAL(10,2) PATH '$.unitPrice'
)) oi ON TRUE
WHERE o.status = 'completed'
  AND o.created_at >= CURRENT_DATE
GROUP BY oi.product_id
ORDER BY SUM(oi.quantity * oi.unit_price) DESC
LIMIT 10;

-- QueryLeaf provides comprehensive aggregation capabilities:
-- 1. Complex multi-level aggregations with CTEs and subqueries
-- 2. Advanced window functions for time-series analysis and trends
-- 3. JSON aggregation functions for flexible data processing
-- 4. Text analysis capabilities with pattern matching and sentiment analysis
-- 5. Statistical functions including percentiles and standard deviation
-- 6. Faceted queries for dashboard and real-time analytics
-- 7. Flexible grouping and segmentation with conditional logic
-- 8. Performance optimization with proper indexing hints
-- 9. Real-time metrics calculation with temporal filtering
-- 10. Integration with MongoDB's native aggregation framework optimizations

Best Practices for Aggregation Pipeline Optimization

Pipeline Design Guidelines

Essential principles for optimal MongoDB aggregation performance:

  1. Early Filtering: Place $match stages as early as possible to reduce dataset size
  2. Index Utilization: Design pipelines to leverage existing indexes effectively
  3. Memory Management: Use allowDiskUse for large datasets and monitor memory usage
  4. Stage Ordering: Follow optimal stage ordering principles for performance
  5. Projection Early: Use $project stages to reduce document size in pipeline
  6. Batch Size Optimization: Configure appropriate cursor batch sizes for large results

Production Performance Optimization

Optimize MongoDB aggregation pipelines for production workloads:

  1. Performance Monitoring: Implement continuous pipeline performance monitoring
  2. Result Caching: Cache aggregation results for frequently executed pipelines
  3. Incremental Processing: Design incremental aggregation patterns for large datasets
  4. Resource Management: Monitor CPU, memory, and disk usage during aggregation
  5. Query Profiling: Use MongoDB profiler to identify aggregation bottlenecks
  6. Parallel Processing: Leverage sharding and replica sets for parallel aggregation

Conclusion

MongoDB's advanced aggregation framework provides comprehensive data processing capabilities that eliminate the limitations and complexity of traditional relational database aggregation approaches. The flexible pipeline architecture supports sophisticated analytics, real-time processing, and complex transformations while maintaining optimal performance at scale.

Key MongoDB Aggregation benefits include:

  • Flexible Pipeline Architecture: Multi-stage processing with optimizable stage ordering and memory management
  • Rich Analytical Capabilities: Advanced operators supporting complex calculations, statistical analysis, and data transformations
  • Performance Optimization: Built-in query optimization, index integration, and resource management
  • Real-time Processing: Support for real-time analytics and streaming aggregation operations
  • Scalable Architecture: Pipeline execution across replica sets and sharded clusters
  • SQL-Familiar Interface: QueryLeaf integration providing familiar aggregation syntax and patterns

Whether you're building real-time dashboards, conducting complex business intelligence analysis, or implementing sophisticated data processing workflows, MongoDB's aggregation framework with QueryLeaf's familiar SQL interface provides the foundation for high-performance analytical operations.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB aggregation pipelines while providing SQL-familiar aggregation syntax, window functions, and analytical capabilities. Complex data transformations, statistical analysis, and real-time analytics are seamlessly handled through familiar SQL constructs, making sophisticated data processing both powerful and accessible to SQL-oriented development teams.

The combination of native MongoDB aggregation capabilities with SQL-style operations makes MongoDB an ideal platform for applications requiring both flexible data processing and familiar analytical patterns, ensuring your applications can handle complex analytical workloads while remaining maintainable and performant as they scale.

MongoDB Query Optimization and Explain Plans: Advanced Performance Analysis for High-Performance Database Operations

Database performance optimization is critical for applications that demand fast response times and efficient resource utilization. Poor query performance can lead to degraded user experience, increased infrastructure costs, and system bottlenecks that become increasingly problematic as data volumes and user loads grow.

MongoDB's sophisticated query optimizer and explain plan system provide comprehensive insights into query execution strategies, enabling developers and database administrators to identify performance bottlenecks, optimize index usage, and fine-tune queries for maximum efficiency. Unlike traditional database systems with limited query analysis tools, MongoDB's explain functionality offers detailed execution statistics, index usage patterns, and optimization recommendations that support both development and production performance tuning.

The Traditional Query Analysis Challenge

Conventional database systems often provide limited query analysis capabilities that make performance optimization difficult:

-- Traditional PostgreSQL query analysis with limited optimization insights

-- Basic EXPLAIN output with limited actionable information
EXPLAIN ANALYZE
SELECT 
  u.user_id,
  u.email,
  u.first_name,
  u.last_name,
  u.created_at,
  COUNT(o.order_id) as order_count,
  SUM(o.total_amount) as total_spent,
  AVG(o.total_amount) as avg_order_value,
  MAX(o.created_at) as last_order_date
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
WHERE u.status = 'active'
  AND u.country IN ('US', 'CA', 'UK')
  AND u.created_at >= '2023-01-01'
  AND (o.status = 'completed' OR o.status IS NULL)
GROUP BY u.user_id, u.email, u.first_name, u.last_name, u.created_at
HAVING COUNT(o.order_id) > 0 OR u.created_at >= '2024-01-01'
ORDER BY total_spent DESC, order_count DESC
LIMIT 100;

-- PostgreSQL EXPLAIN output (simplified representation):
--
-- Limit  (cost=15234.45..15234.70 rows=100 width=64) (actual time=245.123..245.167 rows=100 loops=1)
--   ->  Sort  (cost=15234.45..15489.78 rows=102133 width=64) (actual time=245.121..245.138 rows=100 loops=1)
--         Sort Key: (sum(o.total_amount)) DESC, (count(o.order_id)) DESC  
--         Sort Method: top-N heapsort  Memory: 40kB
--         ->  HashAggregate  (cost=11234.56..12456.89 rows=102133 width=64) (actual time=198.456..223.789 rows=45678 loops=1)
--               Group Key: u.user_id, u.email, u.first_name, u.last_name, u.created_at
--               ->  Hash Left Join  (cost=2345.67..8901.23 rows=345678 width=48) (actual time=12.456..89.123 rows=123456 loops=1)
--                     Hash Cond: (u.user_id = o.user_id)
--                     ->  Bitmap Heap Scan on users u  (cost=234.56..1789.45 rows=12345 width=32) (actual time=3.456..15.789 rows=8901 loops=1)
--                           Recheck Cond: ((status = 'active'::text) AND (country = ANY ('{US,CA,UK}'::text[])) AND (created_at >= '2023-01-01'::date))
--                           Heap Blocks: exact=234
--                           ->  BitmapOr  (cost=234.56..234.56 rows=12345 width=0) (actual time=2.890..2.891 rows=0 loops=1)
--                                 ->  Bitmap Index Scan on idx_users_status  (cost=0.00..78.12 rows=4567 width=0) (actual time=0.890..0.890 rows=3456 loops=1)
--                                       Index Cond: (status = 'active'::text)
--                     ->  Hash  (cost=1890.45..1890.45 rows=17890 width=24) (actual time=8.567..8.567 rows=14567 loops=1)
--                           Buckets: 32768  Batches: 1  Memory Usage: 798kB
--                           ->  Seq Scan on orders o  (cost=0.00..1890.45 rows=17890 width=24) (actual time=0.123..5.456 rows=14567 loops=1)
--                                 Filter: ((status = 'completed'::text) OR (status IS NULL))
--                                 Rows Removed by Filter: 3456
-- Planning Time: 2.456 ms
-- Execution Time: 245.678 ms

-- Problems with traditional PostgreSQL EXPLAIN:
-- 1. Complex output format that's difficult to interpret quickly
-- 2. Limited insights into index selection reasoning and alternatives
-- 3. No built-in recommendations for performance improvements
-- 4. Difficult to compare execution plans across different query variations
-- 5. Limited visibility into buffer usage, I/O patterns, and memory allocation
-- 6. No integration with query optimization recommendations or automated tuning
-- 7. Verbose output that makes it hard to identify key performance bottlenecks
-- 8. Limited historical explain plan tracking and performance trend analysis

-- Alternative PostgreSQL analysis approaches
-- Using pg_stat_statements for query analysis (requires extension)
SELECT 
  query,
  calls,
  total_time,
  mean_time,
  rows,
  100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements 
WHERE query LIKE '%users%orders%'
ORDER BY mean_time DESC
LIMIT 10;

-- Problems with pg_stat_statements:
-- - Requires additional configuration and extensions
-- - Limited detail about specific execution patterns
-- - No real-time optimization recommendations
-- - Difficult correlation between query patterns and index usage
-- - Limited integration with application performance monitoring

-- MySQL approach (even more limited)
EXPLAIN FORMAT=JSON
SELECT u.user_id, u.email, COUNT(o.order_id) as orders
FROM users u 
LEFT JOIN orders o ON u.user_id = o.user_id 
WHERE u.status = 'active'
GROUP BY u.user_id, u.email;

-- MySQL EXPLAIN limitations:
-- {
--   "query_block": {
--     "select_id": 1,
--     "cost_info": {
--       "query_cost": "1234.56"
--     },
--     "grouping_operation": {
--       "using_filesort": false,
--       "nested_loop": [
--         {
--           "table": {
--             "table_name": "u",
--             "access_type": "range",
--             "possible_keys": ["idx_status"],
--             "key": "idx_status",
--             "used_key_parts": ["status"],
--             "key_length": "767",
--             "rows_examined_per_scan": 1000,
--             "rows_produced_per_join": 1000,
--             "cost_info": {
--               "read_cost": "200.00",
--               "eval_cost": "100.00",
--               "prefix_cost": "300.00",
--               "data_read_per_join": "64K"
--             }
--           }
--         }
--       ]
--     }
--   }
-- }

-- MySQL EXPLAIN problems:
-- - Very basic cost model with limited accuracy
-- - No detailed execution statistics or actual vs estimated comparisons
-- - Limited index optimization recommendations  
-- - Basic JSON format that's difficult to analyze programmatically
-- - No integration with performance monitoring or automated optimization
-- - Limited support for complex query patterns and aggregations
-- - Minimal historical performance tracking capabilities

MongoDB provides comprehensive query analysis and optimization tools:

// MongoDB Advanced Query Optimization - comprehensive explain plans and performance analysis
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('ecommerce_analytics');

// Advanced query optimization and explain plan analysis system
class MongoQueryOptimizer {
  constructor(db) {
    this.db = db;
    this.collections = {
      users: db.collection('users'),
      orders: db.collection('orders'),
      products: db.collection('products'),
      analytics: db.collection('analytics')
    };

    // Performance analysis configuration
    this.performanceTargets = {
      maxExecutionTimeMs: 100,
      maxDocsExamined: 10000,
      minIndexHitRate: 0.95,
      maxMemoryUsageMB: 32
    };

    this.optimizationStrategies = new Map();
    this.explainCache = new Map();
  }

  async analyzeQueryPerformance(collection, pipeline, options = {}) {
    console.log('Analyzing query performance with comprehensive explain plans...');

    const {
      verbosity = 'executionStats', // 'queryPlanner', 'executionStats', 'allPlansExecution'
      includeRecommendations = true,
      compareAlternatives = true,
      trackMetrics = true
    } = options;

    // Get the collection reference
    const coll = typeof collection === 'string' ? this.collections[collection] : collection;

    // Execute explain with comprehensive analysis
    const explainResult = await this.performComprehensiveExplain(coll, pipeline, verbosity);

    // Analyze explain plan for optimization opportunities
    const analysis = this.analyzeExplainPlan(explainResult);

    // Generate optimization recommendations
    const recommendations = includeRecommendations ? 
      await this.generateOptimizationRecommendations(coll, pipeline, explainResult, analysis) : [];

    // Compare with alternative query strategies
    const alternatives = compareAlternatives ? 
      await this.generateQueryAlternatives(coll, pipeline, explainResult) : [];

    // Track performance metrics for historical analysis
    if (trackMetrics) {
      await this.recordPerformanceMetrics(coll.collectionName, pipeline, explainResult, analysis);
    }

    const performanceReport = {
      query: {
        collection: coll.collectionName,
        pipeline: pipeline,
        timestamp: new Date()
      },

      execution: {
        totalTimeMs: explainResult.executionStats?.executionTimeMillis || 0,
        totalDocsExamined: explainResult.executionStats?.totalDocsExamined || 0,
        totalDocsReturned: explainResult.executionStats?.totalDocsReturned || 0,
        executionSuccess: explainResult.executionStats?.executionSuccess || false,
        indexesUsed: this.extractIndexesUsed(explainResult),
        memoryUsage: this.calculateMemoryUsage(explainResult)
      },

      performance: {
        efficiency: this.calculateQueryEfficiency(explainResult),
        indexHitRate: this.calculateIndexHitRate(explainResult),
        selectivity: this.calculateSelectivity(explainResult),
        performanceGrade: this.assignPerformanceGrade(explainResult),
        bottlenecks: analysis.bottlenecks,
        strengths: analysis.strengths
      },

      optimization: {
        recommendations: recommendations,
        alternatives: alternatives,
        estimatedImprovement: this.estimateOptimizationImpact(recommendations),
        prioritizedActions: this.prioritizeOptimizations(recommendations)
      },

      explainDetails: explainResult
    };

    console.log(`Query analysis completed - Performance Grade: ${performanceReport.performance.performanceGrade}`);
    console.log(`Execution Time: ${performanceReport.execution.totalTimeMs}ms`);
    console.log(`Documents Examined: ${performanceReport.execution.totalDocsExamined}`);
    console.log(`Documents Returned: ${performanceReport.execution.totalDocsReturned}`);
    console.log(`Index Hit Rate: ${(performanceReport.performance.indexHitRate * 100).toFixed(1)}%`);

    return performanceReport;
  }

  async performComprehensiveExplain(collection, pipeline, verbosity) {
    console.log(`Executing explain with verbosity: ${verbosity}`);

    try {
      // Handle different query types
      if (Array.isArray(pipeline)) {
        // Aggregation pipeline
        const cursor = collection.aggregate(pipeline);
        return await cursor.explain(verbosity);
      } else if (typeof pipeline === 'object' && pipeline.find) {
        // Find query
        const cursor = collection.find(pipeline.find, pipeline.options || {});
        if (pipeline.sort) cursor.sort(pipeline.sort);
        if (pipeline.limit) cursor.limit(pipeline.limit);
        if (pipeline.skip) cursor.skip(pipeline.skip);

        return await cursor.explain(verbosity);
      } else {
        // Simple find query
        const cursor = collection.find(pipeline);
        return await cursor.explain(verbosity);
      }
    } catch (error) {
      console.error('Explain execution failed:', error);
      return {
        error: error.message,
        executionSuccess: false,
        executionTimeMillis: 0
      };
    }
  }

  analyzeExplainPlan(explainResult) {
    console.log('Analyzing explain plan for performance insights...');

    const analysis = {
      queryType: this.identifyQueryType(explainResult),
      executionPattern: this.analyzeExecutionPattern(explainResult),
      indexUsage: this.analyzeIndexUsage(explainResult),
      bottlenecks: [],
      strengths: [],
      riskFactors: [],
      optimizationOpportunities: []
    };

    // Identify performance bottlenecks
    analysis.bottlenecks = this.identifyBottlenecks(explainResult);

    // Identify query strengths
    analysis.strengths = this.identifyStrengths(explainResult);

    // Identify risk factors
    analysis.riskFactors = this.identifyRiskFactors(explainResult);

    // Identify optimization opportunities
    analysis.optimizationOpportunities = this.identifyOptimizationOpportunities(explainResult);

    return analysis;
  }

  identifyBottlenecks(explainResult) {
    const bottlenecks = [];
    const stats = explainResult.executionStats;

    if (!stats) return bottlenecks;

    // Collection scan bottleneck
    if (this.hasCollectionScan(explainResult)) {
      bottlenecks.push({
        type: 'COLLECTION_SCAN',
        severity: 'HIGH',
        description: 'Query performs collection scan instead of using index',
        impact: 'High CPU and I/O usage, poor scalability',
        docsExamined: stats.totalDocsExamined
      });
    }

    // Poor index selectivity
    const selectivity = this.calculateSelectivity(explainResult);
    if (selectivity < 0.1) {
      bottlenecks.push({
        type: 'POOR_SELECTIVITY',
        severity: 'MEDIUM',
        description: 'Index selectivity is poor, examining many unnecessary documents',
        impact: 'Increased I/O and processing time',
        selectivity: selectivity,
        docsExamined: stats.totalDocsExamined,
        docsReturned: stats.totalDocsReturned
      });
    }

    // High execution time
    if (stats.executionTimeMillis > this.performanceTargets.maxExecutionTimeMs) {
      bottlenecks.push({
        type: 'HIGH_EXECUTION_TIME',
        severity: 'HIGH',
        description: 'Query execution time exceeds performance target',
        impact: 'User experience degradation, resource contention',
        executionTime: stats.executionTimeMillis,
        target: this.performanceTargets.maxExecutionTimeMs
      });
    }

    // Sort without index
    if (this.hasSortWithoutIndex(explainResult)) {
      bottlenecks.push({
        type: 'SORT_WITHOUT_INDEX',
        severity: 'MEDIUM',
        description: 'Sort operation performed in memory without index support',
        impact: 'High memory usage, slower sort performance',
        memoryUsage: this.calculateSortMemoryUsage(explainResult)
      });
    }

    // Large result set without limit
    if (stats.totalDocsReturned > 1000 && !this.hasLimit(explainResult)) {
      bottlenecks.push({
        type: 'LARGE_RESULT_SET',
        severity: 'MEDIUM',
        description: 'Query returns large number of documents without limit',
        impact: 'High memory usage, network overhead',
        docsReturned: stats.totalDocsReturned
      });
    }

    return bottlenecks;
  }

  identifyStrengths(explainResult) {
    const strengths = [];
    const stats = explainResult.executionStats;

    if (!stats) return strengths;

    // Efficient index usage
    if (this.hasEfficientIndexUsage(explainResult)) {
      strengths.push({
        type: 'EFFICIENT_INDEX_USAGE',
        description: 'Query uses indexes efficiently with good selectivity',
        indexesUsed: this.extractIndexesUsed(explainResult),
        selectivity: this.calculateSelectivity(explainResult)
      });
    }

    // Fast execution time
    if (stats.executionTimeMillis < this.performanceTargets.maxExecutionTimeMs * 0.5) {
      strengths.push({
        type: 'FAST_EXECUTION',
        description: 'Query executes well below performance targets',
        executionTime: stats.executionTimeMillis,
        target: this.performanceTargets.maxExecutionTimeMs
      });
    }

    // Covered query
    if (this.isCoveredQuery(explainResult)) {
      strengths.push({
        type: 'COVERED_QUERY',
        description: 'Query is covered entirely by index, no document retrieval needed',
        indexesUsed: this.extractIndexesUsed(explainResult)
      });
    }

    // Good result set size management
    if (stats.totalDocsReturned < 100 || this.hasLimit(explainResult)) {
      strengths.push({
        type: 'APPROPRIATE_RESULT_SIZE',
        description: 'Query returns appropriate number of documents',
        docsReturned: stats.totalDocsReturned,
        hasLimit: this.hasLimit(explainResult)
      });
    }

    return strengths;
  }

  async generateOptimizationRecommendations(collection, pipeline, explainResult, analysis) {
    console.log('Generating optimization recommendations...');

    const recommendations = [];

    // Index recommendations based on bottlenecks
    for (const bottleneck of analysis.bottlenecks) {
      switch (bottleneck.type) {
        case 'COLLECTION_SCAN':
          recommendations.push({
            type: 'CREATE_INDEX',
            priority: 'HIGH',
            description: 'Create index to eliminate collection scan',
            action: await this.suggestIndexForQuery(collection, pipeline, explainResult),
            estimatedImprovement: '80-95% reduction in execution time',
            implementation: 'Create compound index on filtered and sorted fields'
          });
          break;

        case 'POOR_SELECTIVITY':
          recommendations.push({
            type: 'IMPROVE_INDEX_SELECTIVITY',
            priority: 'MEDIUM',
            description: 'Improve index selectivity with partial index or compound index',
            action: await this.suggestSelectivityImprovement(collection, pipeline, explainResult),
            estimatedImprovement: '30-60% reduction in documents examined',
            implementation: 'Add partial filter or reorganize compound index field order'
          });
          break;

        case 'SORT_WITHOUT_INDEX':
          recommendations.push({
            type: 'INDEX_FOR_SORT',
            priority: 'MEDIUM',
            description: 'Create or modify index to support sort operation',
            action: await this.suggestSortIndex(collection, pipeline, explainResult),
            estimatedImprovement: '50-80% reduction in memory usage and sort time',
            implementation: 'Include sort fields in compound index following ESR pattern'
          });
          break;

        case 'LARGE_RESULT_SET':
          recommendations.push({
            type: 'LIMIT_RESULT_SET',
            priority: 'LOW',
            description: 'Add pagination or result limiting to reduce memory usage',
            action: 'Add $limit stage or implement pagination',
            estimatedImprovement: 'Reduced memory usage and network overhead',
            implementation: 'Implement cursor-based pagination or reasonable limits'
          });
          break;
      }
    }

    // Query restructuring recommendations
    const structuralRecs = await this.suggestQueryRestructuring(collection, pipeline, explainResult);
    recommendations.push(...structuralRecs);

    // Aggregation pipeline optimization
    if (Array.isArray(pipeline)) {
      const pipelineRecs = await this.suggestPipelineOptimizations(pipeline, explainResult);
      recommendations.push(...pipelineRecs);
    }

    return recommendations;
  }

  async generateQueryAlternatives(collection, pipeline, explainResult) {
    console.log('Generating alternative query strategies...');

    const alternatives = [];

    // Test different index hints
    const indexAlternatives = await this.testIndexAlternatives(collection, pipeline);
    alternatives.push(...indexAlternatives);

    // Test different aggregation pipeline orders
    if (Array.isArray(pipeline)) {
      const pipelineAlternatives = await this.testPipelineAlternatives(collection, pipeline);
      alternatives.push(...pipelineAlternatives);
    }

    // Test query restructuring alternatives
    const structuralAlternatives = await this.testStructuralAlternatives(collection, pipeline);
    alternatives.push(...structuralAlternatives);

    return alternatives;
  }

  async suggestIndexForQuery(collection, pipeline, explainResult) {
    // Analyze query pattern to suggest optimal index
    const queryFields = this.extractQueryFields(pipeline);
    const sortFields = this.extractSortFields(pipeline);

    const indexSuggestion = {
      fields: {},
      options: {}
    };

    // Apply ESR (Equality, Sort, Range) pattern
    const equalityFields = queryFields.equality || [];
    const rangeFields = queryFields.range || [];

    // Add equality fields first
    equalityFields.forEach(field => {
      indexSuggestion.fields[field] = 1;
    });

    // Add sort fields
    if (sortFields) {
      Object.entries(sortFields).forEach(([field, direction]) => {
        indexSuggestion.fields[field] = direction;
      });
    }

    // Add range fields last
    rangeFields.forEach(field => {
      if (!indexSuggestion.fields[field]) {
        indexSuggestion.fields[field] = 1;
      }
    });

    // Suggest partial index if selective filters present
    if (queryFields.selective && queryFields.selective.length > 0) {
      indexSuggestion.options.partialFilterExpression = this.buildPartialFilter(queryFields.selective);
    }

    return {
      indexSpec: indexSuggestion.fields,
      indexOptions: indexSuggestion.options,
      createCommand: `db.${collection.collectionName}.createIndex(${JSON.stringify(indexSuggestion.fields)}, ${JSON.stringify(indexSuggestion.options)})`,
      explanation: this.explainIndexSuggestion(indexSuggestion, queryFields, sortFields)
    };
  }

  calculateQueryEfficiency(explainResult) {
    const stats = explainResult.executionStats;
    if (!stats) return 0;

    const docsExamined = stats.totalDocsExamined || 0;
    const docsReturned = stats.totalDocsReturned || 0;

    if (docsExamined === 0) return 1;

    return Math.min(1, docsReturned / docsExamined);
  }

  calculateIndexHitRate(explainResult) {
    if (this.hasCollectionScan(explainResult)) return 0;

    const indexUsage = this.analyzeIndexUsage(explainResult);
    return indexUsage.effectiveness || 0.5;
  }

  calculateSelectivity(explainResult) {
    const stats = explainResult.executionStats;
    if (!stats) return 0;

    const docsExamined = stats.totalDocsExamined || 0;
    const docsReturned = stats.totalDocsReturned || 0;

    if (docsExamined === 0) return 1;

    return docsReturned / docsExamined;
  }

  assignPerformanceGrade(explainResult) {
    const efficiency = this.calculateQueryEfficiency(explainResult);
    const indexHitRate = this.calculateIndexHitRate(explainResult);
    const stats = explainResult.executionStats;
    const executionTime = stats?.executionTimeMillis || 0;

    let score = 0;

    // Efficiency scoring (40% weight)
    if (efficiency >= 0.9) score += 40;
    else if (efficiency >= 0.7) score += 30;
    else if (efficiency >= 0.5) score += 20;
    else if (efficiency >= 0.2) score += 10;

    // Index usage scoring (35% weight)
    if (indexHitRate >= 0.95) score += 35;
    else if (indexHitRate >= 0.8) score += 25;
    else if (indexHitRate >= 0.5) score += 15;
    else if (indexHitRate >= 0.2) score += 5;

    // Execution time scoring (25% weight)
    if (executionTime <= 50) score += 25;
    else if (executionTime <= 100) score += 20;
    else if (executionTime <= 250) score += 15;
    else if (executionTime <= 500) score += 10;
    else if (executionTime <= 1000) score += 5;

    // Convert to letter grade
    if (score >= 85) return 'A';
    else if (score >= 75) return 'B';
    else if (score >= 65) return 'C';
    else if (score >= 50) return 'D';
    else return 'F';
  }

  // Helper methods for detailed analysis

  hasCollectionScan(explainResult) {
    return this.findStageInPlan(explainResult, 'COLLSCAN') !== null;
  }

  hasSortWithoutIndex(explainResult) {
    const sortStage = this.findStageInPlan(explainResult, 'SORT');
    return sortStage !== null && !sortStage.inputStage?.stage?.includes('IXSCAN');
  }

  hasLimit(explainResult) {
    return this.findStageInPlan(explainResult, 'LIMIT') !== null;
  }

  isCoveredQuery(explainResult) {
    // Check if query is covered by examining projection and index keys
    const projectionStage = this.findStageInPlan(explainResult, 'PROJECTION_COVERED');
    return projectionStage !== null;
  }

  hasEfficientIndexUsage(explainResult) {
    const selectivity = this.calculateSelectivity(explainResult);
    const indexHitRate = this.calculateIndexHitRate(explainResult);
    return selectivity > 0.1 && indexHitRate > 0.8;
  }

  findStageInPlan(explainResult, stageName) {
    // Recursively search through execution plan for specific stage
    const searchStage = (stage) => {
      if (!stage) return null;

      if (stage.stage === stageName) return stage;

      if (stage.inputStage) {
        const result = searchStage(stage.inputStage);
        if (result) return result;
      }

      if (stage.inputStages) {
        for (const inputStage of stage.inputStages) {
          const result = searchStage(inputStage);
          if (result) return result;
        }
      }

      return null;
    };

    const executionStats = explainResult.executionStats;
    if (executionStats?.executionStages) {
      return searchStage(executionStats.executionStages);
    }

    return null;
  }

  extractIndexesUsed(explainResult) {
    const indexes = new Set();

    const findIndexes = (stage) => {
      if (!stage) return;

      if (stage.indexName) {
        indexes.add(stage.indexName);
      }

      if (stage.inputStage) {
        findIndexes(stage.inputStage);
      }

      if (stage.inputStages) {
        stage.inputStages.forEach(inputStage => findIndexes(inputStage));
      }
    };

    const executionStats = explainResult.executionStats;
    if (executionStats?.executionStages) {
      findIndexes(executionStats.executionStages);
    }

    return Array.from(indexes);
  }

  extractQueryFields(pipeline) {
    // Extract fields used in query conditions
    const fields = {
      equality: [],
      range: [],
      selective: []
    };

    if (Array.isArray(pipeline)) {
      // Aggregation pipeline
      pipeline.forEach(stage => {
        if (stage.$match) {
          this.extractFieldsFromMatch(stage.$match, fields);
        }
      });
    } else if (typeof pipeline === 'object') {
      // Find query
      if (pipeline.find) {
        this.extractFieldsFromMatch(pipeline.find, fields);
      } else {
        this.extractFieldsFromMatch(pipeline, fields);
      }
    }

    return fields;
  }

  extractFieldsFromMatch(matchStage, fields) {
    Object.entries(matchStage).forEach(([field, condition]) => {
      if (field.startsWith('$')) return; // Skip operators

      if (typeof condition === 'object' && condition !== null) {
        const operators = Object.keys(condition);
        if (operators.some(op => ['$gt', '$gte', '$lt', '$lte'].includes(op))) {
          fields.range.push(field);
        } else if (operators.includes('$in')) {
          if (condition.$in.length <= 5) {
            fields.selective.push(field);
          } else {
            fields.equality.push(field);
          }
        } else {
          fields.equality.push(field);
        }
      } else {
        fields.equality.push(field);
      }
    });
  }

  extractSortFields(pipeline) {
    if (Array.isArray(pipeline)) {
      for (const stage of pipeline) {
        if (stage.$sort) {
          return stage.$sort;
        }
      }
    } else if (pipeline.sort) {
      return pipeline.sort;
    }

    return null;
  }

  async recordPerformanceMetrics(collectionName, pipeline, explainResult, analysis) {
    try {
      const metrics = {
        timestamp: new Date(),
        collection: collectionName,
        queryHash: this.generateQueryHash(pipeline),
        pipeline: pipeline,

        execution: {
          timeMs: explainResult.executionStats?.executionTimeMillis || 0,
          docsExamined: explainResult.executionStats?.totalDocsExamined || 0,
          docsReturned: explainResult.executionStats?.totalDocsReturned || 0,
          indexesUsed: this.extractIndexesUsed(explainResult),
          success: explainResult.executionStats?.executionSuccess !== false
        },

        performance: {
          efficiency: this.calculateQueryEfficiency(explainResult),
          indexHitRate: this.calculateIndexHitRate(explainResult),
          selectivity: this.calculateSelectivity(explainResult),
          grade: this.assignPerformanceGrade(explainResult)
        },

        analysis: {
          bottleneckCount: analysis.bottlenecks.length,
          strengthCount: analysis.strengths.length,
          queryType: analysis.queryType,
          riskLevel: this.calculateRiskLevel(analysis.riskFactors)
        }
      };

      await this.collections.analytics.insertOne(metrics);
    } catch (error) {
      console.warn('Failed to record performance metrics:', error.message);
    }
  }

  generateQueryHash(pipeline) {
    // Generate consistent hash for query pattern identification
    const queryString = JSON.stringify(pipeline, Object.keys(pipeline).sort());
    return require('crypto').createHash('md5').update(queryString).digest('hex');
  }

  calculateMemoryUsage(explainResult) {
    // Estimate memory usage from explain plan
    let memoryUsage = 0;

    const sortStage = this.findStageInPlan(explainResult, 'SORT');
    if (sortStage) {
      // Estimate sort memory usage
      memoryUsage += (explainResult.executionStats?.totalDocsExamined || 0) * 0.001; // Rough estimate
    }

    return memoryUsage;
  }

  calculateSortMemoryUsage(explainResult) {
    const stats = explainResult.executionStats;
    if (!stats) return 0;

    // Estimate memory usage for in-memory sort
    const avgDocSize = 1024; // Estimated average document size in bytes
    const docsToSort = stats.totalDocsExamined || 0;

    return (docsToSort * avgDocSize) / (1024 * 1024); // Convert to MB
  }

  async performBatchQueryAnalysis(queries) {
    console.log(`Analyzing batch of ${queries.length} queries...`);

    const results = [];
    const batchMetrics = {
      totalQueries: queries.length,
      analyzedSuccessfully: 0,
      averageExecutionTime: 0,
      averageEfficiency: 0,
      gradeDistribution: { A: 0, B: 0, C: 0, D: 0, F: 0 },
      commonBottlenecks: new Map(),
      recommendationFrequency: new Map()
    };

    for (let i = 0; i < queries.length; i++) {
      const query = queries[i];
      console.log(`Analyzing query ${i + 1}/${queries.length}: ${query.name || 'Unnamed'}`);

      try {
        const analysis = await this.analyzeQueryPerformance(query.collection, query.pipeline, query.options);
        results.push({
          queryIndex: i,
          queryName: query.name || `Query_${i + 1}`,
          analysis: analysis,
          success: true
        });

        // Update batch metrics
        batchMetrics.analyzedSuccessfully++;
        batchMetrics.averageExecutionTime += analysis.execution.totalTimeMs;
        batchMetrics.averageEfficiency += analysis.performance.efficiency;
        batchMetrics.gradeDistribution[analysis.performance.performanceGrade]++;

        // Track common bottlenecks
        analysis.performance.bottlenecks.forEach(bottleneck => {
          const count = batchMetrics.commonBottlenecks.get(bottleneck.type) || 0;
          batchMetrics.commonBottlenecks.set(bottleneck.type, count + 1);
        });

        // Track recommendation frequency
        analysis.optimization.recommendations.forEach(rec => {
          const count = batchMetrics.recommendationFrequency.get(rec.type) || 0;
          batchMetrics.recommendationFrequency.set(rec.type, count + 1);
        });

      } catch (error) {
        console.error(`Query ${i + 1} analysis failed:`, error.message);
        results.push({
          queryIndex: i,
          queryName: query.name || `Query_${i + 1}`,
          error: error.message,
          success: false
        });
      }
    }

    // Calculate final batch metrics
    if (batchMetrics.analyzedSuccessfully > 0) {
      batchMetrics.averageExecutionTime /= batchMetrics.analyzedSuccessfully;
      batchMetrics.averageEfficiency /= batchMetrics.analyzedSuccessfully;
    }

    // Convert Maps to Objects for JSON serialization
    batchMetrics.commonBottlenecks = Object.fromEntries(batchMetrics.commonBottlenecks);
    batchMetrics.recommendationFrequency = Object.fromEntries(batchMetrics.recommendationFrequency);

    console.log(`Batch analysis completed: ${batchMetrics.analyzedSuccessfully}/${batchMetrics.totalQueries} queries analyzed successfully`);
    console.log(`Average execution time: ${batchMetrics.averageExecutionTime.toFixed(2)}ms`);
    console.log(`Average efficiency: ${(batchMetrics.averageEfficiency * 100).toFixed(1)}%`);

    return {
      results: results,
      batchMetrics: batchMetrics,
      summary: {
        totalAnalyzed: batchMetrics.analyzedSuccessfully,
        averagePerformance: batchMetrics.averageEfficiency,
        mostCommonBottleneck: this.getMostCommon(batchMetrics.commonBottlenecks),
        mostCommonRecommendation: this.getMostCommon(batchMetrics.recommendationFrequency),
        performanceDistribution: batchMetrics.gradeDistribution
      }
    };
  }

  getMostCommon(frequency) {
    let maxCount = 0;
    let mostCommon = null;

    Object.entries(frequency).forEach(([key, count]) => {
      if (count > maxCount) {
        maxCount = count;
        mostCommon = key;
      }
    });

    return { type: mostCommon, count: maxCount };
  }

  // Additional helper methods for comprehensive analysis...

  identifyQueryType(explainResult) {
    if (this.findStageInPlan(explainResult, 'GROUP')) return 'aggregation';
    if (this.findStageInPlan(explainResult, 'SORT')) return 'sorted_query';
    if (this.hasLimit(explainResult)) return 'limited_query';
    return 'simple_query';
  }

  analyzeExecutionPattern(explainResult) {
    const pattern = {
      hasIndexScan: this.findStageInPlan(explainResult, 'IXSCAN') !== null,
      hasCollectionScan: this.hasCollectionScan(explainResult),
      hasSort: this.findStageInPlan(explainResult, 'SORT') !== null,
      hasGroup: this.findStageInPlan(explainResult, 'GROUP') !== null,
      hasLimit: this.hasLimit(explainResult)
    };

    return pattern;
  }

  analyzeIndexUsage(explainResult) {
    const indexesUsed = this.extractIndexesUsed(explainResult);
    const hasCollScan = this.hasCollectionScan(explainResult);

    return {
      indexCount: indexesUsed.length,
      indexes: indexesUsed,
      hasCollectionScan: hasCollScan,
      effectiveness: hasCollScan ? 0 : Math.min(1, this.calculateSelectivity(explainResult))
    };
  }

  identifyRiskFactors(explainResult) {
    const risks = [];
    const stats = explainResult.executionStats;

    if (stats?.totalDocsExamined > 100000) {
      risks.push({
        type: 'HIGH_DOCUMENT_EXAMINATION',
        description: 'Query examines very large number of documents',
        impact: 'Scalability concerns, resource intensive'
      });
    }

    if (this.hasCollectionScan(explainResult)) {
      risks.push({
        type: 'COLLECTION_SCAN_SCALING',
        description: 'Collection scan will degrade with data growth',
        impact: 'Linear performance degradation as data grows'
      });
    }

    return risks;
  }

  identifyOptimizationOpportunities(explainResult) {
    const opportunities = [];

    if (this.hasCollectionScan(explainResult)) {
      opportunities.push({
        type: 'INDEX_CREATION',
        description: 'Create appropriate indexes to eliminate collection scans',
        impact: 'Significant performance improvement'
      });
    }

    if (this.hasSortWithoutIndex(explainResult)) {
      opportunities.push({
        type: 'SORT_OPTIMIZATION',
        description: 'Optimize index to support sort operations',
        impact: 'Reduced memory usage and faster sorting'
      });
    }

    return opportunities;
  }

  calculateRiskLevel(riskFactors) {
    if (riskFactors.length === 0) return 'LOW';
    if (riskFactors.some(r => r.type.includes('HIGH') || r.type.includes('CRITICAL'))) return 'HIGH';
    if (riskFactors.length > 2) return 'MEDIUM';
    return 'LOW';
  }
}

// Benefits of MongoDB Query Optimization and Explain Plans:
// - Comprehensive execution plan analysis with detailed performance metrics
// - Automatic bottleneck identification and optimization recommendations
// - Advanced index usage analysis and index suggestion algorithms
// - Real-time query performance monitoring and historical trending
// - Intelligent query alternative generation and comparative analysis
// - Integration with aggregation pipeline optimization techniques
// - Detailed memory usage analysis and resource consumption tracking
// - Batch query analysis capabilities for application-wide performance review
// - Automated performance grading and risk assessment
// - Production-ready performance monitoring and alerting integration

module.exports = {
  MongoQueryOptimizer
};

Understanding MongoDB Query Optimization Architecture

Advanced Query Analysis Techniques and Performance Tuning

Implement sophisticated query analysis patterns for production optimization:

// Advanced query optimization patterns and performance monitoring
class AdvancedQueryAnalyzer {
  constructor(db) {
    this.db = db;
    this.performanceHistory = new Map();
    this.optimizationRules = new Map();
    this.alertThresholds = {
      executionTimeMs: 1000,
      docsExaminedRatio: 10,
      indexHitRate: 0.8
    };
  }

  async implementRealTimePerformanceMonitoring(collections) {
    console.log('Setting up real-time query performance monitoring...');

    // Enable database profiling for detailed query analysis
    await this.db.runCommand({
      profile: 2, // Profile all operations
      slowms: 100, // Log operations slower than 100ms
      sampleRate: 0.1 // Sample 10% of operations
    });

    // Create performance monitoring aggregation pipeline
    const monitoringPipeline = [
      {
        $match: {
          ts: { $gte: new Date(Date.now() - 60000) }, // Last minute
          ns: { $in: collections.map(col => `${this.db.databaseName}.${col}`) },
          command: { $exists: true }
        }
      },
      {
        $addFields: {
          queryType: {
            $switch: {
              branches: [
                { case: { $ne: ['$command.find', null] }, then: 'find' },
                { case: { $ne: ['$command.aggregate', null] }, then: 'aggregate' },
                { case: { $ne: ['$command.update', null] }, then: 'update' },
                { case: { $ne: ['$command.delete', null] }, then: 'delete' }
              ],
              default: 'other'
            }
          },

          // Extract query shape for pattern analysis
          queryShape: {
            $switch: {
              branches: [
                {
                  case: { $ne: ['$command.find', null] },
                  then: { $objectToArray: { $ifNull: ['$command.filter', {}] } }
                },
                {
                  case: { $ne: ['$command.aggregate', null] },
                  then: { $arrayElemAt: ['$command.pipeline', 0] }
                }
              ],
              default: {}
            }
          },

          // Performance metrics calculation
          efficiency: {
            $cond: {
              if: { $gt: ['$docsExamined', 0] },
              then: { $divide: ['$nreturned', '$docsExamined'] },
              else: 1
            }
          },

          // Index usage assessment
          indexUsed: {
            $cond: {
              if: { $ne: ['$planSummary', null] },
              then: { $not: { $regexMatch: { input: '$planSummary', regex: 'COLLSCAN' } } },
              else: false
            }
          }
        }
      },
      {
        $group: {
          _id: {
            collection: { $arrayElemAt: [{ $split: ['$ns', '.'] }, 1] },
            queryType: '$queryType',
            queryShape: '$queryShape'
          },

          // Aggregated performance metrics
          avgExecutionTime: { $avg: '$millis' },
          maxExecutionTime: { $max: '$millis' },
          totalQueries: { $sum: 1 },
          avgEfficiency: { $avg: '$efficiency' },
          avgDocsExamined: { $avg: '$docsExamined' },
          avgDocsReturned: { $avg: '$nreturned' },
          indexUsageRate: { $avg: { $cond: ['$indexUsed', 1, 0] } },

          // Query examples for further analysis
          sampleQueries: { $push: { command: '$command', millis: '$millis' } }
        }
      },
      {
        $match: {
          $or: [
            { avgExecutionTime: { $gt: this.alertThresholds.executionTimeMs } },
            { avgEfficiency: { $lt: 0.1 } },
            { indexUsageRate: { $lt: this.alertThresholds.indexHitRate } }
          ]
        }
      },
      {
        $sort: { avgExecutionTime: -1 }
      }
    ];

    try {
      const performanceIssues = await this.db.collection('system.profile')
        .aggregate(monitoringPipeline).toArray();

      // Process identified performance issues
      for (const issue of performanceIssues) {
        await this.processPerformanceIssue(issue);
      }

      console.log(`Performance monitoring identified ${performanceIssues.length} potential issues`);
      return performanceIssues;

    } catch (error) {
      console.error('Performance monitoring failed:', error);
      return [];
    }
  }

  async processPerformanceIssue(issue) {
    const issueSignature = this.generateIssueSignature(issue);

    // Check if this issue has been seen before
    if (this.performanceHistory.has(issueSignature)) {
      const history = this.performanceHistory.get(issueSignature);
      history.occurrences++;
      history.lastSeen = new Date();

      // Escalate if recurring issue
      if (history.occurrences > 5) {
        await this.escalatePerformanceIssue(issue, history);
      }
    } else {
      // New issue, add to tracking
      this.performanceHistory.set(issueSignature, {
        firstSeen: new Date(),
        lastSeen: new Date(),
        occurrences: 1,
        issue: issue
      });
    }

    // Generate optimization recommendations
    const recommendations = await this.generateRealtimeRecommendations(issue);

    // Log performance alert
    await this.logPerformanceAlert({
      timestamp: new Date(),
      collection: issue._id.collection,
      queryType: issue._id.queryType,
      severity: this.calculateSeverity(issue),
      metrics: {
        avgExecutionTime: issue.avgExecutionTime,
        avgEfficiency: issue.avgEfficiency,
        indexUsageRate: issue.indexUsageRate,
        totalQueries: issue.totalQueries
      },
      recommendations: recommendations,
      issueSignature: issueSignature
    });
  }

  async generateRealtimeRecommendations(issue) {
    const recommendations = [];

    // Low index usage rate
    if (issue.indexUsageRate < this.alertThresholds.indexHitRate) {
      recommendations.push({
        type: 'INDEX_OPTIMIZATION',
        priority: 'HIGH',
        description: `Collection ${issue._id.collection} has low index usage rate (${(issue.indexUsageRate * 100).toFixed(1)}%)`,
        action: 'Analyze query patterns and create appropriate indexes',
        queryType: issue._id.queryType
      });
    }

    // High execution time
    if (issue.avgExecutionTime > this.alertThresholds.executionTimeMs) {
      recommendations.push({
        type: 'PERFORMANCE_OPTIMIZATION',
        priority: 'HIGH',
        description: `Queries on ${issue._id.collection} averaging ${issue.avgExecutionTime.toFixed(2)}ms execution time`,
        action: 'Review query structure and index strategy',
        queryType: issue._id.queryType
      });
    }

    // Poor efficiency
    if (issue.avgEfficiency < 0.1) {
      recommendations.push({
        type: 'SELECTIVITY_IMPROVEMENT',
        priority: 'MEDIUM',
        description: `Poor query selectivity detected (${(issue.avgEfficiency * 100).toFixed(1)}% efficiency)`,
        action: 'Implement more selective query filters or partial indexes',
        queryType: issue._id.queryType
      });
    }

    return recommendations;
  }

  async performHistoricalPerformanceAnalysis(timeRange = '7d') {
    console.log(`Performing historical performance analysis for ${timeRange}...`);

    const timeRangeMs = this.parseTimeRange(timeRange);
    const startDate = new Date(Date.now() - timeRangeMs);

    const historicalAnalysis = await this.db.collection('system.profile').aggregate([
      {
        $match: {
          ts: { $gte: startDate },
          command: { $exists: true },
          millis: { $exists: true }
        }
      },
      {
        $addFields: {
          hour: { $dateToString: { format: '%Y-%m-%d-%H', date: '$ts' } },
          collection: { $arrayElemAt: [{ $split: ['$ns', '.'] }, 1] },
          queryType: {
            $switch: {
              branches: [
                { case: { $ne: ['$command.find', null] }, then: 'find' },
                { case: { $ne: ['$command.aggregate', null] }, then: 'aggregate' },
                { case: { $ne: ['$command.update', null] }, then: 'update' }
              ],
              default: 'other'
            }
          }
        }
      },
      {
        $group: {
          _id: {
            hour: '$hour',
            collection: '$collection',
            queryType: '$queryType'
          },

          // Time-based metrics
          queryCount: { $sum: 1 },
          avgLatency: { $avg: '$millis' },
          maxLatency: { $max: '$millis' },
          p95Latency: { 
            $percentile: { 
              input: '$millis', 
              p: [0.95], 
              method: 'approximate' 
            }
          },

          // Efficiency metrics
          totalDocsExamined: { $sum: '$docsExamined' },
          totalDocsReturned: { $sum: '$nreturned' },
          avgEfficiency: {
            $avg: {
              $cond: {
                if: { $gt: ['$docsExamined', 0] },
                then: { $divide: ['$nreturned', '$docsExamined'] },
                else: 1
              }
            }
          },

          // Index usage tracking
          collectionScans: {
            $sum: {
              $cond: [
                { $regexMatch: { input: { $ifNull: ['$planSummary', ''] }, regex: 'COLLSCAN' } },
                1,
                0
              ]
            }
          }
        }
      },
      {
        $addFields: {
          indexUsageRate: {
            $subtract: [1, { $divide: ['$collectionScans', '$queryCount'] }]
          },

          // Performance trend calculation
          performanceScore: {
            $add: [
              { $multiply: [{ $min: [1, { $divide: [1000, '$avgLatency'] }] }, 0.4] },
              { $multiply: ['$avgEfficiency', 0.3] },
              { $multiply: ['$indexUsageRate', 0.3] }
            ]
          }
        }
      },
      {
        $sort: { '_id.hour': 1, performanceScore: 1 }
      }
    ]).toArray();

    // Analyze trends and patterns
    const trendAnalysis = this.analyzePerformanceTrends(historicalAnalysis);
    const recommendations = this.generateHistoricalRecommendations(trendAnalysis);

    return {
      timeRange: timeRange,
      analysis: historicalAnalysis,
      trends: trendAnalysis,
      recommendations: recommendations,
      summary: {
        totalHours: new Set(historicalAnalysis.map(h => h._id.hour)).size,
        collectionsAnalyzed: new Set(historicalAnalysis.map(h => h._id.collection)).size,
        avgPerformanceScore: historicalAnalysis.reduce((sum, h) => sum + h.performanceScore, 0) / historicalAnalysis.length,
        worstPerformingHour: historicalAnalysis[0],
        bestPerformingHour: historicalAnalysis[historicalAnalysis.length - 1]
      }
    };
  }

  analyzePerformanceTrends(historicalData) {
    const trends = {
      latencyTrend: this.calculateTrend(historicalData, 'avgLatency'),
      throughputTrend: this.calculateTrend(historicalData, 'queryCount'),
      efficiencyTrend: this.calculateTrend(historicalData, 'avgEfficiency'),
      indexUsageTrend: this.calculateTrend(historicalData, 'indexUsageRate'),

      // Peak usage analysis
      peakHours: this.identifyPeakHours(historicalData),

      // Performance degradation detection
      degradationPeriods: this.identifyDegradationPeriods(historicalData),

      // Collection-specific trends
      collectionTrends: this.analyzeCollectionTrends(historicalData)
    };

    return trends;
  }

  calculateTrend(data, metric) {
    if (data.length < 2) return { direction: 'stable', magnitude: 0 };

    const values = data.map(d => d[metric]).filter(v => v != null);
    const n = values.length;

    if (n < 2) return { direction: 'stable', magnitude: 0 };

    // Simple linear regression for trend calculation
    const xSum = (n * (n + 1)) / 2;
    const ySum = values.reduce((sum, val) => sum + val, 0);
    const xySum = values.reduce((sum, val, i) => sum + val * (i + 1), 0);
    const x2Sum = (n * (n + 1) * (2 * n + 1)) / 6;

    const slope = (n * xySum - xSum * ySum) / (n * x2Sum - xSum * xSum);
    const magnitude = Math.abs(slope);

    let direction = 'stable';
    if (slope > magnitude * 0.1) direction = 'improving';
    else if (slope < -magnitude * 0.1) direction = 'degrading';

    return { direction, magnitude, slope };
  }

  async implementAutomatedOptimization(collectionName, optimizationRules) {
    console.log(`Implementing automated optimization for ${collectionName}...`);

    const collection = this.db.collection(collectionName);
    const optimizationResults = [];

    for (const rule of optimizationRules) {
      try {
        switch (rule.type) {
          case 'AUTO_INDEX_CREATION':
            const indexResult = await this.createOptimizedIndex(collection, rule);
            optimizationResults.push(indexResult);
            break;

          case 'QUERY_REWRITE':
            const rewriteResult = await this.implementQueryRewrite(collection, rule);
            optimizationResults.push(rewriteResult);
            break;

          case 'AGGREGATION_OPTIMIZATION':
            const aggResult = await this.optimizeAggregationPipeline(collection, rule);
            optimizationResults.push(aggResult);
            break;

          default:
            console.warn(`Unknown optimization rule type: ${rule.type}`);
        }
      } catch (error) {
        console.error(`Optimization rule ${rule.type} failed:`, error);
        optimizationResults.push({
          rule: rule.type,
          success: false,
          error: error.message
        });
      }
    }

    // Validate optimization effectiveness
    const validationResults = await this.validateOptimizations(collection, optimizationResults);

    return {
      collection: collectionName,
      optimizationsApplied: optimizationResults,
      validation: validationResults,
      summary: {
        totalRules: optimizationRules.length,
        successful: optimizationResults.filter(r => r.success).length,
        failed: optimizationResults.filter(r => !r.success).length
      }
    };
  }

  async createOptimizedIndex(collection, rule) {
    console.log(`Creating optimized index: ${rule.indexName}`);

    try {
      const indexSpec = rule.indexSpec;
      const indexOptions = rule.indexOptions || {};

      // Add background: true for production safety
      indexOptions.background = true;

      await collection.createIndex(indexSpec, {
        name: rule.indexName,
        ...indexOptions
      });

      // Test index effectiveness
      const testResult = await this.testIndexEffectiveness(collection, rule);

      return {
        rule: 'AUTO_INDEX_CREATION',
        indexName: rule.indexName,
        indexSpec: indexSpec,
        success: true,
        effectiveness: testResult,
        message: `Index ${rule.indexName} created successfully`
      };

    } catch (error) {
      return {
        rule: 'AUTO_INDEX_CREATION',
        indexName: rule.indexName,
        success: false,
        error: error.message
      };
    }
  }

  async testIndexEffectiveness(collection, rule) {
    if (!rule.testQuery) return { tested: false };

    try {
      // Execute test query with explain
      const explainResult = await collection.find(rule.testQuery).explain('executionStats');

      const effectiveness = {
        tested: true,
        indexUsed: !this.hasCollectionScan(explainResult),
        executionTimeMs: explainResult.executionStats?.executionTimeMillis || 0,
        docsExamined: explainResult.executionStats?.totalDocsExamined || 0,
        docsReturned: explainResult.executionStats?.totalDocsReturned || 0,
        efficiency: this.calculateQueryEfficiency(explainResult)
      };

      return effectiveness;

    } catch (error) {
      return {
        tested: false,
        error: error.message
      };
    }
  }

  // Additional helper methods...

  generateIssueSignature(issue) {
    const key = JSON.stringify({
      collection: issue._id.collection,
      queryType: issue._id.queryType,
      queryShape: issue._id.queryShape
    });
    return require('crypto').createHash('md5').update(key).digest('hex');
  }

  calculateSeverity(issue) {
    let score = 0;

    if (issue.avgExecutionTime > 2000) score += 3;
    else if (issue.avgExecutionTime > 1000) score += 2;
    else if (issue.avgExecutionTime > 500) score += 1;

    if (issue.avgEfficiency < 0.05) score += 3;
    else if (issue.avgEfficiency < 0.1) score += 2;
    else if (issue.avgEfficiency < 0.2) score += 1;

    if (issue.indexUsageRate < 0.5) score += 2;
    else if (issue.indexUsageRate < 0.8) score += 1;

    if (score >= 6) return 'CRITICAL';
    else if (score >= 4) return 'HIGH';
    else if (score >= 2) return 'MEDIUM';
    else return 'LOW';
  }

  parseTimeRange(timeRange) {
    const units = {
      'd': 24 * 60 * 60 * 1000,
      'h': 60 * 60 * 1000,
      'm': 60 * 1000
    };

    const match = timeRange.match(/(\d+)([dhm])/);
    if (!match) return 7 * 24 * 60 * 60 * 1000; // Default 7 days

    const [, amount, unit] = match;
    return parseInt(amount) * units[unit];
  }

  async logPerformanceAlert(alert) {
    try {
      await this.db.collection('performance_alerts').insertOne(alert);
    } catch (error) {
      console.warn('Failed to log performance alert:', error.message);
    }
  }
}

SQL-Style Query Analysis with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB query optimization and explain plan analysis:

-- QueryLeaf query optimization with SQL-familiar EXPLAIN syntax

-- Basic query explain with performance analysis
EXPLAIN (ANALYZE true, BUFFERS true, TIMING true)
SELECT 
  user_id,
  email,
  first_name,
  last_name,
  status,
  created_at
FROM users 
WHERE status = 'active' 
  AND country IN ('US', 'CA', 'UK')
  AND created_at >= CURRENT_DATE - INTERVAL '1 year'
ORDER BY created_at DESC
LIMIT 100;

-- Advanced aggregation explain with optimization recommendations  
EXPLAIN (ANALYZE true, COSTS true, VERBOSE true, FORMAT JSON)
WITH user_activity_summary AS (
  SELECT 
    u.user_id,
    u.email,
    u.first_name,
    u.last_name,
    u.country,
    u.status,
    COUNT(o.order_id) as order_count,
    SUM(o.total_amount) as total_spent,
    AVG(o.total_amount) as avg_order_value,
    MAX(o.created_at) as last_order_date,

    -- Customer value segmentation
    CASE 
      WHEN SUM(o.total_amount) > 1000 THEN 'high_value'
      WHEN SUM(o.total_amount) > 100 THEN 'medium_value'
      ELSE 'low_value'
    END as value_segment,

    -- Activity recency scoring
    CASE 
      WHEN MAX(o.created_at) >= CURRENT_DATE - INTERVAL '30 days' THEN 'recent'
      WHEN MAX(o.created_at) >= CURRENT_DATE - INTERVAL '90 days' THEN 'moderate' 
      WHEN MAX(o.created_at) >= CURRENT_DATE - INTERVAL '1 year' THEN 'old'
      ELSE 'inactive'
    END as activity_segment

  FROM users u
  LEFT JOIN orders o ON u.user_id = o.user_id 
  WHERE u.status = 'active'
    AND u.country IN ('US', 'CA', 'UK', 'AU', 'DE')
    AND u.created_at >= CURRENT_DATE - INTERVAL '2 years'
    AND (o.status = 'completed' OR o.status IS NULL)
  GROUP BY u.user_id, u.email, u.first_name, u.last_name, u.country, u.status
  HAVING COUNT(o.order_id) > 0 OR u.created_at >= CURRENT_DATE - INTERVAL '6 months'
),

customer_insights AS (
  SELECT 
    country,
    value_segment,
    activity_segment,
    COUNT(*) as customer_count,
    AVG(total_spent) as avg_customer_value,
    SUM(order_count) as total_orders,

    -- Geographic performance metrics
    AVG(order_count) as avg_orders_per_customer,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY total_spent) as median_customer_value,
    STDDEV(total_spent) as customer_value_stddev,

    -- Customer concentration analysis
    COUNT(*) / SUM(COUNT(*)) OVER (PARTITION BY country) as segment_concentration,

    -- Activity trend indicators
    COUNT(*) FILTER (WHERE activity_segment = 'recent') as recent_active_customers,
    COUNT(*) FILTER (WHERE activity_segment IN ('moderate', 'old')) as declining_customers

  FROM user_activity_summary
  GROUP BY country, value_segment, activity_segment
)

SELECT 
  country,
  value_segment,
  activity_segment,
  customer_count,
  ROUND(avg_customer_value::numeric, 2) as avg_customer_ltv,
  total_orders,
  ROUND(avg_orders_per_customer::numeric, 1) as avg_orders_per_customer,
  ROUND(median_customer_value::numeric, 2) as median_ltv,
  ROUND(segment_concentration::numeric, 4) as market_concentration,

  -- Performance indicators
  CASE 
    WHEN recent_active_customers > declining_customers THEN 'growing'
    WHEN recent_active_customers < declining_customers * 0.5 THEN 'declining'
    ELSE 'stable'
  END as segment_trend,

  -- Business intelligence insights
  CASE
    WHEN value_segment = 'high_value' AND activity_segment = 'recent' THEN 'premium_active'
    WHEN value_segment = 'high_value' AND activity_segment != 'recent' THEN 'at_risk_premium'
    WHEN value_segment != 'low_value' AND activity_segment = 'recent' THEN 'growth_opportunity'
    WHEN activity_segment = 'inactive' THEN 'reactivation_target'
    ELSE 'standard_segment'
  END as strategic_priority,

  -- Ranking within country
  ROW_NUMBER() OVER (
    PARTITION BY country 
    ORDER BY avg_customer_value DESC, customer_count DESC
  ) as country_segment_rank

FROM customer_insights
WHERE customer_count >= 10  -- Filter small segments
ORDER BY country, avg_customer_value DESC, customer_count DESC;

-- QueryLeaf EXPLAIN output with optimization insights:
-- {
--   "queryType": "aggregation",
--   "executionTimeMillis": 245,
--   "totalDocsExamined": 45678,
--   "totalDocsReturned": 1245,
--   "efficiency": 0.027,
--   "indexUsage": {
--     "indexes": ["users_status_country_idx", "orders_user_status_idx"],
--     "effectiveness": 0.78,
--     "missingIndexes": ["users_created_at_idx", "orders_completed_date_idx"]
--   },
--   "stages": [
--     {
--       "stage": "$match",
--       "inputStage": "IXSCAN",
--       "indexName": "users_status_country_idx",
--       "keysExamined": 12456,
--       "docsExamined": 8901,
--       "executionTimeMillis": 45,
--       "optimization": "GOOD - Using compound index efficiently"
--     },
--     {
--       "stage": "$lookup", 
--       "inputStage": "IXSCAN",
--       "indexName": "orders_user_status_idx",
--       "executionTimeMillis": 156,
--       "optimization": "NEEDS_IMPROVEMENT - Consider creating index on (user_id, status, created_at)"
--     },
--     {
--       "stage": "$group",
--       "executionTimeMillis": 34,
--       "memoryUsageMB": 12.3,
--       "spilledToDisk": false,
--       "optimization": "GOOD - Group operation within memory limits"
--     },
--     {
--       "stage": "$sort",
--       "executionTimeMillis": 10,
--       "memoryUsageMB": 2.1,
--       "optimization": "EXCELLENT - Sort using index order"
--     }
--   ],
--   "recommendations": [
--     {
--       "type": "CREATE_INDEX",
--       "priority": "HIGH",
--       "description": "Create compound index to improve JOIN performance",
--       "suggestedIndex": "CREATE INDEX orders_user_status_date_idx ON orders (user_id, status, created_at DESC)",
--       "estimatedImprovement": "60-80% reduction in lookup time"
--     },
--     {
--       "type": "QUERY_RESTRUCTURE",
--       "priority": "MEDIUM", 
--       "description": "Consider splitting complex aggregation into smaller stages",
--       "estimatedImprovement": "20-40% better resource utilization"
--     }
--   ],
--   "performanceGrade": "C+",
--   "bottlenecks": [
--     {
--       "stage": "$lookup",
--       "issue": "Examining too many documents in joined collection",
--       "impact": "63% of total execution time"
--     }
--   ]
-- }

-- Performance monitoring and optimization tracking
WITH query_performance_analysis AS (
  SELECT 
    DATE_TRUNC('hour', execution_timestamp) as hour_bucket,
    collection_name,
    query_type,

    -- Performance metrics
    COUNT(*) as query_count,
    AVG(execution_time_ms) as avg_execution_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY execution_time_ms) as p95_execution_time,
    MAX(execution_time_ms) as max_execution_time,

    -- Resource utilization
    AVG(docs_examined) as avg_docs_examined,
    AVG(docs_returned) as avg_docs_returned,
    AVG(docs_examined::float / GREATEST(docs_returned, 1)) as avg_scan_ratio,

    -- Index effectiveness
    COUNT(*) FILTER (WHERE index_used = true) as queries_with_index,
    AVG(CASE WHEN index_used THEN 1.0 ELSE 0.0 END) as index_hit_rate,
    STRING_AGG(DISTINCT index_name, ', ') as indexes_used,

    -- Error tracking
    COUNT(*) FILTER (WHERE execution_success = false) as failed_queries,
    STRING_AGG(DISTINCT error_type, '; ') FILTER (WHERE error_type IS NOT NULL) as error_types,

    -- Memory and I/O metrics
    AVG(memory_usage_mb) as avg_memory_usage,
    MAX(memory_usage_mb) as peak_memory_usage,
    COUNT(*) FILTER (WHERE spilled_to_disk = true) as queries_spilled_to_disk

  FROM query_execution_log
  WHERE execution_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    AND collection_name IN ('users', 'orders', 'products', 'analytics')
  GROUP BY DATE_TRUNC('hour', execution_timestamp), collection_name, query_type
),

performance_scoring AS (
  SELECT 
    *,
    -- Performance score calculation (0-100)
    LEAST(100, GREATEST(0,
      -- Execution time score (40% weight)
      (CASE 
        WHEN avg_execution_time <= 50 THEN 40
        WHEN avg_execution_time <= 100 THEN 30
        WHEN avg_execution_time <= 250 THEN 20
        WHEN avg_execution_time <= 500 THEN 10
        ELSE 0
      END) +

      -- Index usage score (35% weight)
      (index_hit_rate * 35) +

      -- Scan efficiency score (25% weight)  
      (CASE
        WHEN avg_scan_ratio <= 1.1 THEN 25
        WHEN avg_scan_ratio <= 2.0 THEN 20
        WHEN avg_scan_ratio <= 5.0 THEN 15
        WHEN avg_scan_ratio <= 10.0 THEN 10
        ELSE 0
      END)
    )) as performance_score,

    -- Performance grade assignment
    CASE 
      WHEN avg_execution_time <= 50 AND index_hit_rate >= 0.9 AND avg_scan_ratio <= 1.5 THEN 'A'
      WHEN avg_execution_time <= 100 AND index_hit_rate >= 0.8 AND avg_scan_ratio <= 3.0 THEN 'B'
      WHEN avg_execution_time <= 250 AND index_hit_rate >= 0.6 AND avg_scan_ratio <= 10.0 THEN 'C'
      WHEN avg_execution_time <= 500 AND index_hit_rate >= 0.4 THEN 'D'
      ELSE 'F'
    END as performance_grade,

    -- Trend analysis (comparing with previous period)
    LAG(avg_execution_time) OVER (
      PARTITION BY collection_name, query_type 
      ORDER BY hour_bucket
    ) as prev_avg_execution_time,

    LAG(index_hit_rate) OVER (
      PARTITION BY collection_name, query_type
      ORDER BY hour_bucket
    ) as prev_index_hit_rate,

    LAG(performance_score) OVER (
      PARTITION BY collection_name, query_type
      ORDER BY hour_bucket  
    ) as prev_performance_score

  FROM query_performance_analysis
),

optimization_recommendations AS (
  SELECT 
    collection_name,
    query_type,
    hour_bucket,
    performance_grade,
    performance_score,

    -- Performance trend indicators
    CASE 
      WHEN prev_performance_score IS NOT NULL THEN
        CASE 
          WHEN performance_score > prev_performance_score + 10 THEN 'IMPROVING'
          WHEN performance_score < prev_performance_score - 10 THEN 'DEGRADING'
          ELSE 'STABLE'
        END
      ELSE 'NEW'
    END as performance_trend,

    -- Specific optimization recommendations
    ARRAY_REMOVE(ARRAY[
      CASE 
        WHEN index_hit_rate < 0.8 THEN 'CREATE_MISSING_INDEXES'
        ELSE NULL
      END,
      CASE
        WHEN avg_scan_ratio > 10 THEN 'IMPROVE_QUERY_SELECTIVITY' 
        ELSE NULL
      END,
      CASE
        WHEN avg_execution_time > 500 THEN 'OPTIMIZE_QUERY_STRUCTURE'
        ELSE NULL
      END,
      CASE
        WHEN failed_queries > query_count * 0.05 THEN 'INVESTIGATE_QUERY_FAILURES'
        ELSE NULL
      END,
      CASE
        WHEN queries_spilled_to_disk > 0 THEN 'REDUCE_MEMORY_USAGE'
        ELSE NULL
      END
    ], NULL) as optimization_actions,

    -- Priority calculation
    CASE
      WHEN performance_grade IN ('D', 'F') AND query_count > 100 THEN 'CRITICAL'
      WHEN performance_grade = 'C' AND query_count > 500 THEN 'HIGH'
      WHEN performance_grade IN ('C', 'D') AND query_count > 50 THEN 'MEDIUM'
      ELSE 'LOW'
    END as optimization_priority,

    -- Detailed metrics for analysis
    query_count,
    avg_execution_time,
    p95_execution_time,
    index_hit_rate,
    avg_scan_ratio,
    failed_queries,
    indexes_used,
    error_types

  FROM performance_scoring
  WHERE query_count >= 5  -- Filter low-volume queries
)

SELECT 
  collection_name,
  query_type,
  performance_grade,
  ROUND(performance_score::numeric, 1) as performance_score,
  performance_trend,
  optimization_priority,

  -- Key performance indicators
  query_count as hourly_query_count,
  ROUND(avg_execution_time::numeric, 2) as avg_latency_ms,
  ROUND(p95_execution_time::numeric, 2) as p95_latency_ms,
  ROUND((index_hit_rate * 100)::numeric, 1) as index_hit_rate_pct,
  ROUND(avg_scan_ratio::numeric, 2) as avg_selectivity_ratio,

  -- Optimization guidance  
  CASE
    WHEN ARRAY_LENGTH(optimization_actions, 1) > 0 THEN
      'Recommended actions: ' || ARRAY_TO_STRING(optimization_actions, ', ')
    ELSE 'Performance within acceptable parameters'
  END as optimization_guidance,

  -- Resource impact assessment
  CASE
    WHEN query_count > 1000 AND performance_grade IN ('D', 'F') THEN 'HIGH_IMPACT'
    WHEN query_count > 500 AND performance_grade = 'C' THEN 'MEDIUM_IMPACT'
    ELSE 'LOW_IMPACT'
  END as resource_impact,

  -- Technical details
  indexes_used,
  error_types,
  hour_bucket as analysis_hour

FROM optimization_recommendations
WHERE optimization_priority IN ('CRITICAL', 'HIGH', 'MEDIUM')
   OR performance_trend = 'DEGRADING'
ORDER BY 
  CASE optimization_priority
    WHEN 'CRITICAL' THEN 1
    WHEN 'HIGH' THEN 2  
    WHEN 'MEDIUM' THEN 3
    ELSE 4
  END,
  performance_score ASC,
  query_count DESC;

-- Real-time query optimization with automated recommendations
CREATE OR REPLACE VIEW query_optimization_dashboard AS
WITH current_performance AS (
  SELECT 
    collection_name,
    query_hash,
    query_pattern,

    -- Recent performance metrics (last hour)
    COUNT(*) as recent_executions,
    AVG(execution_time_ms) as current_avg_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY execution_time_ms) as current_p95_time,
    AVG(docs_examined::float / GREATEST(docs_returned, 1)) as current_scan_ratio,

    -- Index usage analysis
    BOOL_AND(index_used) as all_queries_use_index,
    COUNT(DISTINCT index_name) as unique_indexes_used,
    MODE() WITHIN GROUP (ORDER BY index_name) as most_common_index,

    -- Error rate tracking
    AVG(CASE WHEN execution_success THEN 1.0 ELSE 0.0 END) as success_rate

  FROM query_execution_log
  WHERE execution_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
  GROUP BY collection_name, query_hash, query_pattern
  HAVING COUNT(*) >= 5  -- Minimum threshold for analysis
),

historical_baseline AS (
  SELECT 
    collection_name,
    query_hash,

    -- Historical baseline metrics (previous 24 hours, excluding last hour)
    AVG(execution_time_ms) as baseline_avg_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY execution_time_ms) as baseline_p95_time,
    AVG(docs_examined::float / GREATEST(docs_returned, 1)) as baseline_scan_ratio,
    AVG(CASE WHEN execution_success THEN 1.0 ELSE 0.0 END) as baseline_success_rate

  FROM query_execution_log  
  WHERE execution_timestamp >= CURRENT_TIMESTAMP - INTERVAL '25 hours'
    AND execution_timestamp < CURRENT_TIMESTAMP - INTERVAL '1 hour'
  GROUP BY collection_name, query_hash
  HAVING COUNT(*) >= 20  -- Sufficient historical data
)

SELECT 
  cp.collection_name,
  cp.query_pattern,
  cp.recent_executions,

  -- Performance comparison
  ROUND(cp.current_avg_time::numeric, 2) as current_avg_latency_ms,
  ROUND(hb.baseline_avg_time::numeric, 2) as baseline_avg_latency_ms,
  ROUND(((cp.current_avg_time - hb.baseline_avg_time) / hb.baseline_avg_time * 100)::numeric, 1) as latency_change_pct,

  -- Performance status classification
  CASE 
    WHEN cp.current_avg_time > hb.baseline_avg_time * 1.5 THEN 'DEGRADED'
    WHEN cp.current_avg_time > hb.baseline_avg_time * 1.2 THEN 'SLOWER'
    WHEN cp.current_avg_time < hb.baseline_avg_time * 0.8 THEN 'IMPROVED'
    ELSE 'STABLE'
  END as performance_status,

  -- Index utilization
  cp.all_queries_use_index,
  cp.unique_indexes_used,
  cp.most_common_index,

  -- Scan efficiency
  ROUND(cp.current_scan_ratio::numeric, 2) as current_scan_ratio,
  ROUND(hb.baseline_scan_ratio::numeric, 2) as baseline_scan_ratio,

  -- Reliability metrics
  ROUND((cp.success_rate * 100)::numeric, 2) as success_rate_pct,
  ROUND((hb.baseline_success_rate * 100)::numeric, 2) as baseline_success_rate_pct,

  -- Automated optimization recommendations
  CASE
    WHEN NOT cp.all_queries_use_index THEN 'CRITICAL: Create missing indexes for consistent performance'
    WHEN cp.current_avg_time > hb.baseline_avg_time * 2 THEN 'HIGH: Investigate severe performance regression'
    WHEN cp.current_scan_ratio > hb.baseline_scan_ratio * 2 THEN 'MEDIUM: Review query selectivity and filters'
    WHEN cp.success_rate < 0.95 THEN 'MEDIUM: Address query reliability issues'
    WHEN cp.current_avg_time > hb.baseline_avg_time * 1.2 THEN 'LOW: Monitor for continued degradation'
    ELSE 'No immediate action required'
  END as recommended_action,

  -- Alert priority
  CASE 
    WHEN NOT cp.all_queries_use_index OR cp.current_avg_time > hb.baseline_avg_time * 2 THEN 'ALERT'
    WHEN cp.current_avg_time > hb.baseline_avg_time * 1.5 OR cp.success_rate < 0.9 THEN 'WARNING'
    ELSE 'INFO'
  END as alert_level

FROM current_performance cp
LEFT JOIN historical_baseline hb ON cp.collection_name = hb.collection_name 
                                 AND cp.query_hash = hb.query_hash
ORDER BY 
  CASE 
    WHEN NOT cp.all_queries_use_index OR cp.current_avg_time > COALESCE(hb.baseline_avg_time * 2, 1000) THEN 1
    WHEN cp.current_avg_time > COALESCE(hb.baseline_avg_time * 1.5, 500) THEN 2
    ELSE 3
  END,
  cp.recent_executions DESC;

-- QueryLeaf provides comprehensive query optimization capabilities:
-- 1. SQL-familiar EXPLAIN syntax with detailed execution plan analysis
-- 2. Advanced performance monitoring with historical trend analysis
-- 3. Automated index recommendations based on query patterns
-- 4. Real-time performance alerts and degradation detection
-- 5. Comprehensive bottleneck identification and optimization guidance
-- 6. Resource usage tracking and capacity planning insights
-- 7. Query efficiency scoring and performance grading systems
-- 8. Integration with MongoDB's native explain plan functionality
-- 9. Batch query analysis for application-wide performance review
-- 10. Production-ready monitoring dashboards and optimization workflows

Best Practices for Query Optimization Implementation

Query Analysis Strategy

Essential principles for effective MongoDB query optimization:

  1. Regular Monitoring: Implement continuous query performance monitoring and alerting
  2. Index Strategy: Design indexes based on actual query patterns and performance data
  3. Explain Plan Analysis: Use comprehensive explain plan analysis to identify bottlenecks
  4. Historical Tracking: Maintain historical performance data to identify trends and regressions
  5. Automated Optimization: Implement automated optimization recommendations and validation
  6. Production Safety: Test all optimizations thoroughly before applying to production systems

Performance Tuning Workflow

Optimize MongoDB queries systematically:

  1. Performance Baseline: Establish performance baselines and targets for all critical queries
  2. Bottleneck Identification: Use explain plans to identify specific performance bottlenecks
  3. Optimization Implementation: Apply optimizations following proven patterns and best practices
  4. Validation Testing: Validate optimization effectiveness with comprehensive testing
  5. Monitoring Setup: Implement ongoing monitoring to track optimization impact
  6. Continuous Improvement: Regular review and refinement of optimization strategies

Conclusion

MongoDB's advanced query optimization and explain plan system provides comprehensive tools for identifying performance bottlenecks, analyzing query execution patterns, and implementing effective optimization strategies. The sophisticated explain functionality offers detailed insights that enable both development and production performance tuning with automated recommendations and historical analysis capabilities.

Key MongoDB Query Optimization benefits include:

  • Comprehensive Analysis: Detailed execution plan analysis with performance metrics and bottleneck identification
  • Automated Recommendations: Intelligent optimization suggestions based on query patterns and performance data
  • Real-time Monitoring: Continuous performance monitoring with alerting and trend analysis
  • Production-Ready Tools: Sophisticated analysis tools designed for production database optimization
  • Historical Intelligence: Performance trend analysis and regression detection capabilities
  • Integration-Friendly: Seamless integration with existing monitoring and alerting infrastructure

Whether you're optimizing application queries, managing database performance, or implementing automated optimization workflows, MongoDB's query optimization tools with QueryLeaf's familiar SQL interface provide the foundation for high-performance database operations.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB query optimization while providing SQL-familiar explain plan syntax, performance analysis functions, and optimization recommendations. Advanced query analysis patterns, automated optimization workflows, and comprehensive performance monitoring are seamlessly handled through familiar SQL constructs, making sophisticated database optimization both powerful and accessible to SQL-oriented development teams.

The combination of comprehensive query analysis capabilities with SQL-style operations makes MongoDB an ideal platform for applications requiring both high-performance queries and familiar database optimization patterns, ensuring your applications achieve optimal performance while remaining maintainable as they scale and evolve.

MongoDB Document Validation and Schema Enforcement: Building Data Integrity with Flexible Schema Design and SQL-Style Constraints

Modern applications require the flexibility of document databases while maintaining data integrity and consistency that traditional relational systems provide through rigid schemas and constraints. MongoDB's document validation system bridges this gap by offering configurable schema enforcement that adapts to evolving business requirements without sacrificing data quality.

MongoDB Document Validation provides rule-based data validation that can enforce structure, data types, value ranges, and business logic constraints at the database level. Unlike rigid relational schemas that require expensive migrations for changes, MongoDB validation rules can evolve incrementally, supporting both strict schema enforcement and flexible document structures within the same database.

The Traditional Schema Rigidity Challenge

Conventional relational database approaches impose inflexible schema constraints that become obstacles to application evolution:

-- Traditional PostgreSQL schema with rigid constraints and migration challenges

-- User table with fixed schema structure
CREATE TABLE users (
  user_id BIGSERIAL PRIMARY KEY,
  email VARCHAR(255) NOT NULL UNIQUE,
  username VARCHAR(50) NOT NULL UNIQUE,
  password_hash VARCHAR(255) NOT NULL,
  first_name VARCHAR(100) NOT NULL,
  last_name VARCHAR(100) NOT NULL,
  birth_date DATE,
  phone_number VARCHAR(20),
  created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
  updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),

  -- Rigid constraints that are difficult to modify
  CONSTRAINT users_email_format CHECK (email ~* '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}$'),
  CONSTRAINT users_phone_format CHECK (phone_number ~* '^\+?[1-9]\d{1,14}$'),
  CONSTRAINT users_birth_date_range CHECK (birth_date >= '1900-01-01' AND birth_date <= CURRENT_DATE),
  CONSTRAINT users_name_length CHECK (LENGTH(first_name) >= 2 AND LENGTH(last_name) >= 2)
);

-- User profile table with limited JSON support
CREATE TABLE user_profiles (
  profile_id BIGSERIAL PRIMARY KEY,
  user_id BIGINT NOT NULL REFERENCES users(user_id) ON DELETE CASCADE,
  bio TEXT,
  avatar_url VARCHAR(500),
  social_links JSONB,
  preferences JSONB,
  metadata JSONB,
  created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
  updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),

  -- Limited JSON validation capabilities
  CONSTRAINT profile_bio_length CHECK (LENGTH(bio) <= 1000),
  CONSTRAINT profile_avatar_url_format CHECK (avatar_url ~* '^https?://.*'),
  CONSTRAINT profile_social_links_structure CHECK (
    social_links IS NULL OR (
      jsonb_typeof(social_links) = 'object' AND
      jsonb_array_length(jsonb_object_keys(social_links)) <= 10
    )
  )
);

-- User settings table with enum constraints
CREATE TYPE notification_frequency AS ENUM ('immediate', 'hourly', 'daily', 'weekly', 'never');
CREATE TYPE privacy_level AS ENUM ('public', 'friends', 'private');
CREATE TYPE theme_preference AS ENUM ('light', 'dark', 'auto');

CREATE TABLE user_settings (
  setting_id BIGSERIAL PRIMARY KEY,
  user_id BIGINT NOT NULL REFERENCES users(user_id) ON DELETE CASCADE,
  email_notifications notification_frequency DEFAULT 'daily',
  push_notifications notification_frequency DEFAULT 'immediate',
  privacy_level privacy_level DEFAULT 'friends',
  theme theme_preference DEFAULT 'auto',
  language_code VARCHAR(5) DEFAULT 'en-US',
  timezone VARCHAR(50) DEFAULT 'UTC',
  created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
  updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),

  -- Rigid enum constraints that require schema changes
  CONSTRAINT settings_language_format CHECK (language_code ~* '^[a-z]{2}(-[A-Z]{2})?$'),
  CONSTRAINT settings_timezone_valid CHECK (timezone IN (
    SELECT name FROM pg_timezone_names WHERE name NOT LIKE '%/%/%'
  ))
);

-- Complex data insertion with rigid validation
INSERT INTO users (
  email, username, password_hash, first_name, last_name, birth_date, phone_number
) VALUES (
  'john.doe@example.com',
  'johndoe123',
  '$2b$12$LQv3c1yqBWVHxkd0LHAkCOYz6TtxMQJqhN8/LewdBxJzybKlJNcX.',
  'John',
  'Doe', 
  '1990-05-15',
  '+1-555-123-4567'
);

-- Profile insertion with limited JSON flexibility
INSERT INTO user_profiles (
  user_id, bio, avatar_url, social_links, preferences, metadata
) VALUES (
  1,
  'Software engineer passionate about technology and innovation.',
  'https://example.com/avatars/johndoe.jpg',
  '{"twitter": "@johndoe", "linkedin": "john-doe-dev", "github": "johndoe"}',
  '{"newsletter": true, "marketing_emails": false, "beta_features": true}',
  '{"account_type": "premium", "registration_source": "web", "referral_code": "FRIEND123"}'
);

-- Settings insertion with enum constraints
INSERT INTO user_settings (
  user_id, email_notifications, push_notifications, privacy_level, theme, language_code, timezone
) VALUES (
  1, 'daily', 'immediate', 'friends', 'dark', 'en-US', 'America/New_York'
);

-- Complex query with multiple table joins and JSON operations
WITH user_analysis AS (
  SELECT 
    u.user_id,
    u.email,
    u.username,
    u.first_name,
    u.last_name,
    u.created_at as registration_date,

    -- Profile information with JSON extraction
    up.bio,
    up.avatar_url,
    jsonb_extract_path_text(up.social_links, 'twitter') as twitter_handle,
    jsonb_extract_path_text(up.social_links, 'github') as github_username,

    -- Preferences with type casting
    CAST(jsonb_extract_path_text(up.preferences, 'newsletter') AS BOOLEAN) as newsletter_subscription,
    CAST(jsonb_extract_path_text(up.preferences, 'beta_features') AS BOOLEAN) as beta_participant,

    -- Metadata extraction
    jsonb_extract_path_text(up.metadata, 'account_type') as account_type,
    jsonb_extract_path_text(up.metadata, 'registration_source') as registration_source,

    -- Settings information
    us.email_notifications,
    us.push_notifications,
    us.privacy_level,
    us.theme,
    us.language_code,
    us.timezone,

    -- Calculated fields
    EXTRACT(YEAR FROM AGE(u.birth_date)) as age,
    EXTRACT(DAYS FROM (NOW() - u.created_at)) as days_since_registration,

    -- JSON array processing for social links
    jsonb_array_length(jsonb_object_keys(COALESCE(up.social_links, '{}'::jsonb))) as social_link_count,

    -- Complex JSON validation checking
    CASE 
      WHEN up.preferences IS NULL THEN 'incomplete'
      WHEN jsonb_typeof(up.preferences) != 'object' THEN 'invalid'
      WHEN NOT up.preferences ? 'newsletter' THEN 'missing_required'
      ELSE 'valid'
    END as preferences_status

  FROM users u
  LEFT JOIN user_profiles up ON u.user_id = up.user_id
  LEFT JOIN user_settings us ON u.user_id = us.user_id
  WHERE u.created_at >= NOW() - INTERVAL '1 year'
)

SELECT 
  user_id,
  email,
  username,
  first_name || ' ' || last_name as full_name,
  registration_date,
  bio,
  twitter_handle,
  github_username,
  account_type,
  registration_source,
  age,
  days_since_registration,

  -- User categorization based on engagement
  CASE 
    WHEN beta_participant AND newsletter_subscription THEN 'highly_engaged'
    WHEN newsletter_subscription OR social_link_count > 2 THEN 'moderately_engaged' 
    WHEN days_since_registration < 30 THEN 'new_user'
    ELSE 'basic_user'
  END as engagement_level,

  -- Notification preference summary
  CASE 
    WHEN email_notifications = 'immediate' AND push_notifications = 'immediate' THEN 'high_frequency'
    WHEN email_notifications IN ('daily', 'hourly') OR push_notifications IN ('daily', 'hourly') THEN 'moderate_frequency'
    ELSE 'low_frequency'
  END as notification_preference,

  -- Data completeness assessment
  CASE 
    WHEN bio IS NOT NULL AND avatar_url IS NOT NULL AND social_link_count > 0 THEN 'complete'
    WHEN bio IS NOT NULL OR avatar_url IS NOT NULL THEN 'partial'
    ELSE 'minimal'
  END as profile_completeness,

  preferences_status

FROM user_analysis
WHERE preferences_status = 'valid'
ORDER BY 
  CASE engagement_level
    WHEN 'highly_engaged' THEN 1
    WHEN 'moderately_engaged' THEN 2  
    WHEN 'new_user' THEN 3
    ELSE 4
  END,
  days_since_registration DESC;

-- Schema evolution challenges with traditional approaches:
-- 1. Adding new fields requires ALTER TABLE statements with potential downtime
-- 2. Changing data types requires complex migrations and data conversion
-- 3. Enum modifications require dropping and recreating types
-- 4. JSON structure changes are difficult to validate and enforce
-- 5. Cross-table constraints become complex to maintain
-- 6. Schema changes require coordinated application deployments
-- 7. Rollback of schema changes is complex and often impossible
-- 8. Performance impact during large table alterations
-- 9. Limited flexibility for storing varying document structures
-- 10. Complex validation logic requires triggers or application-level enforcement

-- MySQL approach with even more limitations
CREATE TABLE mysql_users (
  id BIGINT AUTO_INCREMENT PRIMARY KEY,
  email VARCHAR(255) NOT NULL UNIQUE,
  username VARCHAR(50) NOT NULL UNIQUE,
  profile_data JSON,
  settings JSON,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

  -- Basic JSON validation (limited in older versions)
  CONSTRAINT email_format CHECK (email REGEXP '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$')
);

-- Simple query with limited JSON capabilities
SELECT 
  id,
  email,
  username,
  JSON_EXTRACT(profile_data, '$.first_name') as first_name,
  JSON_EXTRACT(profile_data, '$.last_name') as last_name,
  JSON_EXTRACT(settings, '$.theme') as theme_preference
FROM mysql_users
WHERE JSON_EXTRACT(profile_data, '$.account_type') = 'premium';

-- MySQL limitations:
-- - Very limited JSON validation and constraint capabilities
-- - Basic JSON functions with poor performance on large datasets
-- - No sophisticated document structure validation
-- - Minimal support for nested object validation
-- - Limited flexibility for evolving JSON schemas
-- - Poor indexing support for JSON fields
-- - Basic constraint checking without complex business logic

MongoDB Document Validation provides flexible, powerful schema enforcement:

// MongoDB Document Validation - flexible schema enforcement with powerful validation rules
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('user_management_platform');

// Comprehensive document validation and schema management system
class MongoDBValidationManager {
  constructor(db) {
    this.db = db;
    this.collections = new Map();
    this.validationRules = new Map();
    this.migrationHistory = [];
  }

  async initializeCollectionsWithValidation() {
    console.log('Initializing collections with comprehensive document validation...');

    // Create users collection with sophisticated validation rules
    try {
      await this.db.createCollection('users', {
        validator: {
          $jsonSchema: {
            bsonType: 'object',
            required: ['email', 'username', 'password_hash', 'profile', 'created_at'],
            additionalProperties: false,
            properties: {
              _id: {
                bsonType: 'objectId'
              },

              // Core identity fields with validation
              email: {
                bsonType: 'string',
                pattern: '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$',
                description: 'Valid email address required'
              },

              username: {
                bsonType: 'string',
                minLength: 3,
                maxLength: 30,
                pattern: '^[a-zA-Z0-9_-]+$',
                description: 'Username must be 3-30 characters, alphanumeric with underscore/dash'
              },

              password_hash: {
                bsonType: 'string',
                minLength: 60,
                maxLength: 60,
                description: 'BCrypt hash must be exactly 60 characters'
              },

              // Nested profile object with detailed validation
              profile: {
                bsonType: 'object',
                required: ['first_name', 'last_name'],
                additionalProperties: true,
                properties: {
                  first_name: {
                    bsonType: 'string',
                    minLength: 1,
                    maxLength: 100,
                    description: 'First name is required'
                  },

                  last_name: {
                    bsonType: 'string',
                    minLength: 1,
                    maxLength: 100,
                    description: 'Last name is required'
                  },

                  middle_name: {
                    bsonType: ['string', 'null'],
                    maxLength: 100
                  },

                  birth_date: {
                    bsonType: 'date',
                    description: 'Birth date must be a valid date'
                  },

                  phone_number: {
                    bsonType: ['string', 'null'],
                    pattern: '^\\+?[1-9]\\d{1,14}$',
                    description: 'Valid international phone number format'
                  },

                  bio: {
                    bsonType: ['string', 'null'],
                    maxLength: 1000,
                    description: 'Bio must not exceed 1000 characters'
                  },

                  avatar_url: {
                    bsonType: ['string', 'null'],
                    pattern: '^https?://.*\\.(jpg|jpeg|png|gif|webp)$',
                    description: 'Avatar must be a valid image URL'
                  },

                  // Social links with nested validation
                  social_links: {
                    bsonType: ['object', 'null'],
                    additionalProperties: false,
                    properties: {
                      twitter: {
                        bsonType: 'string',
                        pattern: '^@?[a-zA-Z0-9_]{1,15}$'
                      },
                      linkedin: {
                        bsonType: 'string',
                        pattern: '^[a-zA-Z0-9-]{3,100}$'
                      },
                      github: {
                        bsonType: 'string',
                        pattern: '^[a-zA-Z0-9-]{1,39}$'
                      },
                      website: {
                        bsonType: 'string',
                        pattern: '^https?://[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}.*$'
                      },
                      instagram: {
                        bsonType: 'string',
                        pattern: '^@?[a-zA-Z0-9_.]{1,30}$'
                      }
                    }
                  },

                  // Address with geolocation support
                  address: {
                    bsonType: ['object', 'null'],
                    properties: {
                      street: { bsonType: 'string', maxLength: 200 },
                      city: { bsonType: 'string', maxLength: 100 },
                      state: { bsonType: 'string', maxLength: 100 },
                      postal_code: { bsonType: 'string', maxLength: 20 },
                      country: { bsonType: 'string', maxLength: 100 },
                      coordinates: {
                        bsonType: 'object',
                        properties: {
                          type: { enum: ['Point'] },
                          coordinates: {
                            bsonType: 'array',
                            minItems: 2,
                            maxItems: 2,
                            items: { bsonType: 'number' }
                          }
                        }
                      }
                    }
                  }
                }
              },

              // User preferences with detailed validation
              preferences: {
                bsonType: 'object',
                additionalProperties: true,
                properties: {
                  notifications: {
                    bsonType: 'object',
                    properties: {
                      email: {
                        bsonType: 'object',
                        properties: {
                          marketing: { bsonType: 'bool' },
                          security: { bsonType: 'bool' },
                          product_updates: { bsonType: 'bool' },
                          frequency: { enum: ['immediate', 'daily', 'weekly', 'never'] }
                        }
                      },
                      push: {
                        bsonType: 'object',
                        properties: {
                          enabled: { bsonType: 'bool' },
                          sound: { bsonType: 'bool' },
                          vibration: { bsonType: 'bool' },
                          frequency: { enum: ['immediate', 'hourly', 'daily', 'never'] }
                        }
                      }
                    }
                  },

                  privacy: {
                    bsonType: 'object',
                    properties: {
                      profile_visibility: { enum: ['public', 'friends', 'private'] },
                      search_visibility: { bsonType: 'bool' },
                      activity_status: { bsonType: 'bool' },
                      data_collection: { bsonType: 'bool' }
                    }
                  },

                  interface: {
                    bsonType: 'object',
                    properties: {
                      theme: { enum: ['light', 'dark', 'auto'] },
                      language: {
                        bsonType: 'string',
                        pattern: '^[a-z]{2}(-[A-Z]{2})?$'
                      },
                      timezone: {
                        bsonType: 'string',
                        description: 'Valid IANA timezone'
                      },
                      date_format: { enum: ['MM/DD/YYYY', 'DD/MM/YYYY', 'YYYY-MM-DD'] },
                      time_format: { enum: ['12h', '24h'] }
                    }
                  }
                }
              },

              // Account status and metadata
              account: {
                bsonType: 'object',
                required: ['status', 'type', 'verification'],
                properties: {
                  status: { enum: ['active', 'inactive', 'suspended', 'pending'] },
                  type: { enum: ['free', 'premium', 'enterprise', 'admin'] },
                  subscription_expires_at: { bsonType: ['date', 'null'] },

                  verification: {
                    bsonType: 'object',
                    properties: {
                      email_verified: { bsonType: 'bool' },
                      email_verified_at: { bsonType: ['date', 'null'] },
                      phone_verified: { bsonType: 'bool' },
                      phone_verified_at: { bsonType: ['date', 'null'] },
                      identity_verified: { bsonType: 'bool' },
                      identity_verified_at: { bsonType: ['date', 'null'] },
                      verification_level: { enum: ['none', 'email', 'phone', 'identity', 'full'] }
                    }
                  },

                  security: {
                    bsonType: 'object',
                    properties: {
                      two_factor_enabled: { bsonType: 'bool' },
                      two_factor_method: { enum: ['none', 'sms', 'app', 'email'] },
                      password_changed_at: { bsonType: 'date' },
                      last_password_reset: { bsonType: ['date', 'null'] },
                      failed_login_attempts: { bsonType: 'int', minimum: 0, maximum: 10 },
                      account_locked_until: { bsonType: ['date', 'null'] }
                    }
                  }
                }
              },

              // Activity tracking
              activity: {
                bsonType: 'object',
                properties: {
                  last_login_at: { bsonType: ['date', 'null'] },
                  last_activity_at: { bsonType: ['date', 'null'] },
                  login_count: { bsonType: 'int', minimum: 0 },
                  session_count: { bsonType: 'int', minimum: 0 },
                  ip_address: {
                    bsonType: ['string', 'null'],
                    pattern: '^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$|^(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}$'
                  },
                  user_agent: { bsonType: ['string', 'null'], maxLength: 500 }
                }
              },

              // Flexible metadata for application-specific data
              metadata: {
                bsonType: ['object', 'null'],
                additionalProperties: true,
                properties: {
                  registration_source: {
                    enum: ['web', 'mobile_app', 'api', 'admin', 'import', 'social_oauth']
                  },
                  referral_code: {
                    bsonType: ['string', 'null'],
                    pattern: '^[A-Z0-9]{6,12}$'
                  },
                  campaign_id: { bsonType: ['string', 'null'] },
                  utm_source: { bsonType: ['string', 'null'] },
                  utm_medium: { bsonType: ['string', 'null'] },
                  utm_campaign: { bsonType: ['string', 'null'] },
                  affiliate_id: { bsonType: ['string', 'null'] }
                }
              },

              // Audit timestamps
              created_at: {
                bsonType: 'date',
                description: 'Account creation timestamp required'
              },

              updated_at: {
                bsonType: 'date',
                description: 'Last update timestamp'
              },

              deleted_at: {
                bsonType: ['date', 'null'],
                description: 'Soft delete timestamp'
              }
            }
          }
        },
        validationLevel: 'strict',
        validationAction: 'error'
      });

      console.log('Created users collection with comprehensive validation');
      this.collections.set('users', this.db.collection('users'));

    } catch (error) {
      if (error.code !== 48) { // Collection already exists
        throw error;
      }
      console.log('Users collection already exists');
      this.collections.set('users', this.db.collection('users'));
    }

    // Create additional collections with validation
    await this.createSessionsCollection();
    await this.createAuditLogCollection();
    await this.createNotificationsCollection();

    // Create indexes optimized for validation and queries
    await this.createOptimizedIndexes();

    return Array.from(this.collections.keys());
  }

  async createSessionsCollection() {
    try {
      await this.db.createCollection('user_sessions', {
        validator: {
          $jsonSchema: {
            bsonType: 'object',
            required: ['user_id', 'session_token', 'created_at', 'expires_at', 'is_active'],
            properties: {
              _id: { bsonType: 'objectId' },

              user_id: {
                bsonType: 'objectId',
                description: 'Reference to user document'
              },

              session_token: {
                bsonType: 'string',
                minLength: 32,
                maxLength: 128,
                description: 'Secure session token'
              },

              refresh_token: {
                bsonType: ['string', 'null'],
                minLength: 32,
                maxLength: 128
              },

              device_info: {
                bsonType: 'object',
                properties: {
                  device_type: { enum: ['desktop', 'mobile', 'tablet', 'unknown'] },
                  browser: { bsonType: 'string', maxLength: 100 },
                  os: { bsonType: 'string', maxLength: 100 },
                  ip_address: { bsonType: 'string' },
                  user_agent: { bsonType: 'string', maxLength: 500 }
                }
              },

              location: {
                bsonType: ['object', 'null'],
                properties: {
                  country: { bsonType: 'string', maxLength: 100 },
                  region: { bsonType: 'string', maxLength: 100 },
                  city: { bsonType: 'string', maxLength: 100 },
                  coordinates: {
                    bsonType: 'array',
                    minItems: 2,
                    maxItems: 2,
                    items: { bsonType: 'number' }
                  }
                }
              },

              is_active: { bsonType: 'bool' },

              created_at: { bsonType: 'date' },
              updated_at: { bsonType: 'date' },
              expires_at: { bsonType: 'date' },
              last_activity_at: { bsonType: ['date', 'null'] }
            }
          }
        },
        validationLevel: 'strict'
      });

      // Create TTL index for automatic session cleanup
      await this.db.collection('user_sessions').createIndex(
        { expires_at: 1 }, 
        { expireAfterSeconds: 0 }
      );

      this.collections.set('user_sessions', this.db.collection('user_sessions'));
      console.log('Created user_sessions collection with validation');

    } catch (error) {
      if (error.code !== 48) throw error;
      this.collections.set('user_sessions', this.db.collection('user_sessions'));
    }
  }

  async createAuditLogCollection() {
    try {
      await this.db.createCollection('audit_log', {
        validator: {
          $jsonSchema: {
            bsonType: 'object',
            required: ['user_id', 'action', 'resource_type', 'timestamp'],
            properties: {
              _id: { bsonType: 'objectId' },

              user_id: {
                bsonType: ['objectId', 'null'],
                description: 'User who performed the action'
              },

              action: {
                enum: [
                  'create', 'read', 'update', 'delete',
                  'login', 'logout', 'password_change', 'email_change',
                  'profile_update', 'settings_change', 'verification',
                  'admin_action', 'api_access', 'export_data'
                ],
                description: 'Type of action performed'
              },

              resource_type: {
                bsonType: 'string',
                maxLength: 100,
                description: 'Type of resource affected'
              },

              resource_id: {
                bsonType: ['string', 'objectId', 'null'],
                description: 'ID of the affected resource'
              },

              details: {
                bsonType: ['object', 'null'],
                additionalProperties: true,
                description: 'Additional action details'
              },

              changes: {
                bsonType: ['object', 'null'],
                properties: {
                  before: { bsonType: ['object', 'null'] },
                  after: { bsonType: ['object', 'null'] },
                  fields_changed: {
                    bsonType: 'array',
                    items: { bsonType: 'string' }
                  }
                }
              },

              request_info: {
                bsonType: ['object', 'null'],
                properties: {
                  ip_address: { bsonType: 'string' },
                  user_agent: { bsonType: 'string', maxLength: 500 },
                  method: { enum: ['GET', 'POST', 'PUT', 'PATCH', 'DELETE'] },
                  endpoint: { bsonType: 'string', maxLength: 200 },
                  session_id: { bsonType: ['string', 'null'] }
                }
              },

              result: {
                bsonType: 'object',
                properties: {
                  success: { bsonType: 'bool' },
                  error_message: { bsonType: ['string', 'null'] },
                  error_code: { bsonType: ['string', 'null'] },
                  duration_ms: { bsonType: 'int', minimum: 0 }
                }
              },

              timestamp: { bsonType: 'date' }
            }
          }
        }
      });

      this.collections.set('audit_log', this.db.collection('audit_log'));
      console.log('Created audit_log collection with validation');

    } catch (error) {
      if (error.code !== 48) throw error;
      this.collections.set('audit_log', this.db.collection('audit_log'));
    }
  }

  async createNotificationsCollection() {
    try {
      await this.db.createCollection('notifications', {
        validator: {
          $jsonSchema: {
            bsonType: 'object',
            required: ['user_id', 'type', 'title', 'content', 'status', 'created_at'],
            properties: {
              _id: { bsonType: 'objectId' },

              user_id: {
                bsonType: 'objectId',
                description: 'Target user for notification'
              },

              type: {
                enum: [
                  'security_alert', 'account_update', 'welcome', 'verification',
                  'password_reset', 'login_alert', 'subscription', 'feature_announcement',
                  'maintenance', 'privacy_update', 'marketing', 'system'
                ],
                description: 'Notification category'
              },

              priority: {
                enum: ['low', 'normal', 'high', 'urgent'],
                description: 'Notification priority level'
              },

              title: {
                bsonType: 'string',
                minLength: 1,
                maxLength: 200,
                description: 'Notification title'
              },

              content: {
                bsonType: 'string',
                minLength: 1,
                maxLength: 2000,
                description: 'Notification message content'
              },

              action: {
                bsonType: ['object', 'null'],
                properties: {
                  label: { bsonType: 'string', maxLength: 50 },
                  url: { bsonType: 'string', maxLength: 500 },
                  action_type: { enum: ['link', 'button', 'dismiss', 'confirm'] }
                }
              },

              channels: {
                bsonType: 'array',
                items: {
                  enum: ['email', 'push', 'in_app', 'sms', 'webhook']
                },
                description: 'Delivery channels for notification'
              },

              delivery: {
                bsonType: 'object',
                properties: {
                  email: {
                    bsonType: ['object', 'null'],
                    properties: {
                      sent_at: { bsonType: ['date', 'null'] },
                      delivered_at: { bsonType: ['date', 'null'] },
                      opened_at: { bsonType: ['date', 'null'] },
                      clicked_at: { bsonType: ['date', 'null'] },
                      bounced: { bsonType: 'bool' },
                      error_message: { bsonType: ['string', 'null'] }
                    }
                  },
                  push: {
                    bsonType: ['object', 'null'],
                    properties: {
                      sent_at: { bsonType: ['date', 'null'] },
                      delivered_at: { bsonType: ['date', 'null'] },
                      clicked_at: { bsonType: ['date', 'null'] },
                      error_message: { bsonType: ['string', 'null'] }
                    }
                  },
                  in_app: {
                    bsonType: ['object', 'null'],
                    properties: {
                      shown_at: { bsonType: ['date', 'null'] },
                      clicked_at: { bsonType: ['date', 'null'] },
                      dismissed_at: { bsonType: ['date', 'null'] }
                    }
                  }
                }
              },

              status: {
                enum: ['pending', 'sent', 'delivered', 'read', 'dismissed', 'failed'],
                description: 'Current notification status'
              },

              metadata: {
                bsonType: ['object', 'null'],
                additionalProperties: true,
                description: 'Additional notification metadata'
              },

              expires_at: {
                bsonType: ['date', 'null'],
                description: 'Notification expiration date'
              },

              created_at: { bsonType: 'date' },
              updated_at: { bsonType: 'date' }
            }
          }
        }
      });

      this.collections.set('notifications', this.db.collection('notifications'));
      console.log('Created notifications collection with validation');

    } catch (error) {
      if (error.code !== 48) throw error;
      this.collections.set('notifications', this.db.collection('notifications'));
    }
  }

  async createOptimizedIndexes() {
    console.log('Creating optimized indexes for validated collections...');

    const users = this.collections.get('users');
    const sessions = this.collections.get('user_sessions');
    const audit = this.collections.get('audit_log');
    const notifications = this.collections.get('notifications');

    // User collection indexes
    const userIndexes = [
      { email: 1 },
      { username: 1 },
      { 'account.status': 1 },
      { 'account.type': 1 },
      { created_at: -1 },
      { 'activity.last_login_at': -1 },
      { 'profile.phone_number': 1 },
      { 'account.verification.email_verified': 1 },
      { 'metadata.registration_source': 1 },

      // Compound indexes for common queries
      { 'account.status': 1, 'account.type': 1 },
      { 'account.type': 1, created_at: -1 },
      { 'account.verification.verification_level': 1, created_at: -1 }
    ];

    for (const indexSpec of userIndexes) {
      try {
        await users.createIndex(indexSpec, { background: true });
      } catch (error) {
        console.warn('Index creation warning:', error.message);
      }
    }

    // Session collection indexes
    await sessions.createIndex({ user_id: 1, is_active: 1 }, { background: true });
    await sessions.createIndex({ session_token: 1 }, { unique: true, background: true });
    await sessions.createIndex({ created_at: -1 }, { background: true });

    // Audit log indexes
    await audit.createIndex({ user_id: 1, timestamp: -1 }, { background: true });
    await audit.createIndex({ action: 1, timestamp: -1 }, { background: true });
    await audit.createIndex({ resource_type: 1, resource_id: 1 }, { background: true });

    // Notification indexes
    await notifications.createIndex({ user_id: 1, status: 1 }, { background: true });
    await notifications.createIndex({ type: 1, created_at: -1 }, { background: true });
    await notifications.createIndex({ expires_at: 1 }, { expireAfterSeconds: 0 });

    console.log('Optimized indexes created successfully');
  }

  async insertValidatedUserData(userData) {
    console.log('Inserting user data with comprehensive validation...');

    const users = this.collections.get('users');
    const currentTime = new Date();

    // Prepare validated user document
    const validatedUser = {
      email: userData.email,
      username: userData.username,
      password_hash: userData.password_hash,

      profile: {
        first_name: userData.profile.first_name,
        last_name: userData.profile.last_name,
        middle_name: userData.profile.middle_name || null,
        birth_date: userData.profile.birth_date ? new Date(userData.profile.birth_date) : null,
        phone_number: userData.profile.phone_number || null,
        bio: userData.profile.bio || null,
        avatar_url: userData.profile.avatar_url || null,

        social_links: userData.profile.social_links || null,

        address: userData.profile.address ? {
          street: userData.profile.address.street,
          city: userData.profile.address.city,
          state: userData.profile.address.state,
          postal_code: userData.profile.address.postal_code,
          country: userData.profile.address.country,
          coordinates: userData.profile.address.coordinates ? {
            type: 'Point',
            coordinates: userData.profile.address.coordinates
          } : null
        } : null
      },

      preferences: {
        notifications: {
          email: {
            marketing: userData.preferences?.notifications?.email?.marketing ?? false,
            security: userData.preferences?.notifications?.email?.security ?? true,
            product_updates: userData.preferences?.notifications?.email?.product_updates ?? true,
            frequency: userData.preferences?.notifications?.email?.frequency || 'daily'
          },
          push: {
            enabled: userData.preferences?.notifications?.push?.enabled ?? true,
            sound: userData.preferences?.notifications?.push?.sound ?? true,
            vibration: userData.preferences?.notifications?.push?.vibration ?? true,
            frequency: userData.preferences?.notifications?.push?.frequency || 'immediate'
          }
        },

        privacy: {
          profile_visibility: userData.preferences?.privacy?.profile_visibility || 'friends',
          search_visibility: userData.preferences?.privacy?.search_visibility ?? true,
          activity_status: userData.preferences?.privacy?.activity_status ?? true,
          data_collection: userData.preferences?.privacy?.data_collection ?? true
        },

        interface: {
          theme: userData.preferences?.interface?.theme || 'auto',
          language: userData.preferences?.interface?.language || 'en-US',
          timezone: userData.preferences?.interface?.timezone || 'UTC',
          date_format: userData.preferences?.interface?.date_format || 'MM/DD/YYYY',
          time_format: userData.preferences?.interface?.time_format || '12h'
        }
      },

      account: {
        status: userData.account?.status || 'active',
        type: userData.account?.type || 'free',
        subscription_expires_at: userData.account?.subscription_expires_at ? 
          new Date(userData.account.subscription_expires_at) : null,

        verification: {
          email_verified: false,
          email_verified_at: null,
          phone_verified: false,
          phone_verified_at: null,
          identity_verified: false,
          identity_verified_at: null,
          verification_level: 'none'
        },

        security: {
          two_factor_enabled: false,
          two_factor_method: 'none',
          password_changed_at: currentTime,
          last_password_reset: null,
          failed_login_attempts: 0,
          account_locked_until: null
        }
      },

      activity: {
        last_login_at: null,
        last_activity_at: null,
        login_count: 0,
        session_count: 0,
        ip_address: userData.activity?.ip_address || null,
        user_agent: userData.activity?.user_agent || null
      },

      metadata: userData.metadata || null,

      created_at: currentTime,
      updated_at: currentTime,
      deleted_at: null
    };

    try {
      const result = await users.insertOne(validatedUser);

      // Log successful user creation
      await this.logAuditEvent({
        user_id: result.insertedId,
        action: 'create',
        resource_type: 'user',
        resource_id: result.insertedId.toString(),
        details: {
          username: validatedUser.username,
          email: validatedUser.email,
          account_type: validatedUser.account.type
        },
        request_info: {
          ip_address: validatedUser.activity.ip_address,
          user_agent: validatedUser.activity.user_agent
        },
        result: {
          success: true,
          duration_ms: 0 // Would be calculated in real implementation
        },
        timestamp: currentTime
      });

      console.log(`User created successfully with ID: ${result.insertedId}`);
      return result;

    } catch (validationError) {
      console.error('User validation failed:', validationError);

      // Log failed user creation attempt
      await this.logAuditEvent({
        user_id: null,
        action: 'create',
        resource_type: 'user',
        details: {
          attempted_email: userData.email,
          attempted_username: userData.username
        },
        result: {
          success: false,
          error_message: validationError.message,
          error_code: validationError.code?.toString()
        },
        timestamp: currentTime
      });

      throw validationError;
    }
  }

  async logAuditEvent(eventData) {
    const auditLog = this.collections.get('audit_log');

    try {
      await auditLog.insertOne(eventData);
    } catch (error) {
      console.warn('Failed to log audit event:', error.message);
    }
  }

  async performValidationMigration(collectionName, newValidationRules, options = {}) {
    console.log(`Performing validation migration for collection: ${collectionName}`);

    const {
      validationLevel = 'strict',
      validationAction = 'error',
      dryRun = false,
      batchSize = 1000
    } = options;

    const collection = this.db.collection(collectionName);

    if (dryRun) {
      // Test validation rules against existing documents
      console.log('Running dry run validation test...');

      const validationErrors = [];
      let processedCount = 0;

      const cursor = collection.find({}).limit(batchSize);

      for await (const document of cursor) {
        try {
          // Test document against new validation rules (simplified)
          const testResult = await this.testDocumentValidation(document, newValidationRules);

          if (!testResult.valid) {
            validationErrors.push({
              documentId: document._id,
              errors: testResult.errors
            });
          }

          processedCount++;

        } catch (error) {
          validationErrors.push({
            documentId: document._id,
            errors: [error.message]
          });
        }
      }

      console.log(`Dry run completed: ${processedCount} documents tested, ${validationErrors.length} validation errors found`);

      return {
        dryRun: true,
        documentsProcessed: processedCount,
        validationErrors: validationErrors,
        migrationFeasible: validationErrors.length === 0
      };
    }

    // Apply new validation rules
    try {
      await this.db.runCommand({
        collMod: collectionName,
        validator: newValidationRules,
        validationLevel: validationLevel,
        validationAction: validationAction
      });

      // Record migration in history
      this.migrationHistory.push({
        collection: collectionName,
        timestamp: new Date(),
        validationRules: newValidationRules,
        validationLevel: validationLevel,
        validationAction: validationAction,
        success: true
      });

      console.log(`Validation migration completed successfully for ${collectionName}`);

      return {
        success: true,
        collection: collectionName,
        timestamp: new Date(),
        validationLevel: validationLevel,
        validationAction: validationAction
      };

    } catch (error) {
      console.error('Validation migration failed:', error);

      this.migrationHistory.push({
        collection: collectionName,
        timestamp: new Date(),
        success: false,
        error: error.message
      });

      throw error;
    }
  }

  async testDocumentValidation(document, validationRules) {
    // Simplified validation testing (in real implementation, would use MongoDB's validator)
    try {
      // This would use MongoDB's internal validation logic
      return { valid: true, errors: [] };
    } catch (error) {
      return { valid: false, errors: [error.message] };
    }
  }

  async generateValidationReport() {
    console.log('Generating comprehensive validation report...');

    const report = {
      collections: new Map(),
      summary: {
        totalCollections: 0,
        validatedCollections: 0,
        totalDocuments: 0,
        validationCoverage: 0
      },
      recommendations: []
    };

    for (const [collectionName, collection] of this.collections) {
      console.log(`Analyzing validation for collection: ${collectionName}`);

      try {
        // Get collection info including validation rules
        const collectionInfo = await this.db.runCommand({ listCollections: { filter: { name: collectionName } } });
        const stats = await collection.stats();

        const collectionData = {
          name: collectionName,
          documentCount: stats.count,
          avgDocumentSize: stats.avgObjSize,
          indexCount: stats.nindexes,
          hasValidation: false,
          validationLevel: null,
          validationAction: null,
          validationRules: null
        };

        // Check if validation is configured
        if (collectionInfo.cursor.firstBatch[0]?.options?.validator) {
          collectionData.hasValidation = true;
          collectionData.validationLevel = collectionInfo.cursor.firstBatch[0].options.validationLevel || 'strict';
          collectionData.validationAction = collectionInfo.cursor.firstBatch[0].options.validationAction || 'error';
          collectionData.validationRules = collectionInfo.cursor.firstBatch[0].options.validator;
        }

        report.collections.set(collectionName, collectionData);
        report.summary.totalCollections++;
        report.summary.totalDocuments += stats.count;

        if (collectionData.hasValidation) {
          report.summary.validatedCollections++;
        }

        // Generate recommendations
        if (!collectionData.hasValidation && stats.count > 1000) {
          report.recommendations.push(`Consider adding validation rules to ${collectionName} (${stats.count} documents)`);
        }

        if (collectionData.hasValidation && collectionData.validationLevel === 'moderate') {
          report.recommendations.push(`Consider upgrading ${collectionName} to strict validation for better data integrity`);
        }

      } catch (error) {
        console.warn(`Could not analyze collection ${collectionName}:`, error.message);
      }
    }

    report.summary.validationCoverage = report.summary.totalCollections > 0 ? 
      (report.summary.validatedCollections / report.summary.totalCollections * 100) : 0;

    console.log('Validation report generated successfully');
    return report;
  }
}

// Benefits of MongoDB Document Validation:
// - Flexible schema evolution without complex migrations or downtime
// - Rich validation rules supporting nested objects, arrays, and complex business logic
// - Configurable validation levels (strict, moderate, off) for different environments
// - JSON Schema standard compliance with MongoDB-specific extensions
// - Integration with MongoDB's native indexing and query optimization
// - Support for custom validation logic and conditional constraints
// - Gradual validation enforcement for existing data migration scenarios
// - Real-time validation feedback during development and testing
// - Audit trail capabilities for tracking schema changes and validation events
// - Performance optimizations that leverage MongoDB's document-oriented architecture

module.exports = {
  MongoDBValidationManager
};

Understanding MongoDB Document Validation Architecture

Advanced Validation Patterns and Schema Evolution

Implement sophisticated validation strategies for production applications with evolving requirements:

// Advanced document validation patterns and schema evolution strategies
class AdvancedValidationManager {
  constructor(db) {
    this.db = db;
    this.schemaVersions = new Map();
    this.validationProfiles = new Map();
    this.migrationQueue = [];
  }

  async implementConditionalValidation(collectionName, validationProfiles) {
    console.log(`Implementing conditional validation for ${collectionName}`);

    // Create validation rules that adapt based on document type or version
    const conditionalValidator = {
      $or: validationProfiles.map(profile => ({
        $and: [
          profile.condition,
          { $jsonSchema: profile.schema }
        ]
      }))
    };

    await this.db.runCommand({
      collMod: collectionName,
      validator: conditionalValidator,
      validationLevel: 'strict'
    });

    this.validationProfiles.set(collectionName, validationProfiles);
    return conditionalValidator;
  }

  async implementVersionedValidation(collectionName, versions) {
    console.log(`Setting up versioned validation for ${collectionName}`);

    const versionedValidator = {
      $or: versions.map(version => ({
        $and: [
          { schema_version: { $eq: version.version } },
          { $jsonSchema: version.schema }
        ]
      }))
    };

    // Store version history
    this.schemaVersions.set(collectionName, {
      current: Math.max(...versions.map(v => v.version)),
      versions: versions,
      created_at: new Date()
    });

    await this.db.runCommand({
      collMod: collectionName,
      validator: versionedValidator,
      validationLevel: 'strict'
    });

    return versionedValidator;
  }

  async performGradualMigration(collectionName, targetValidation, options = {}) {
    console.log(`Starting gradual migration for ${collectionName}`);

    const {
      batchSize = 1000,
      delayMs = 100,
      validationMode = 'warn_then_error'
    } = options;

    // Phase 1: Warning mode
    if (validationMode === 'warn_then_error') {
      console.log('Phase 1: Enabling validation in warning mode');
      await this.db.runCommand({
        collMod: collectionName,
        validator: targetValidation,
        validationLevel: 'moderate',
        validationAction: 'warn'
      });

      // Allow time for monitoring and fixing validation warnings
      console.log('Monitoring validation warnings for 24 hours...');
      // In production, this would be a longer monitoring period
    }

    // Phase 2: Strict enforcement
    console.log('Phase 2: Enabling strict validation');
    await this.db.runCommand({
      collMod: collectionName,
      validator: targetValidation,
      validationLevel: 'strict',
      validationAction: 'error'
    });

    console.log('Gradual migration completed successfully');
    return { success: true, phases: 2 };
  }

  generateBusinessLogicValidation(rules) {
    // Convert business rules into MongoDB validation expressions
    const validationExpressions = [];

    for (const rule of rules) {
      switch (rule.type) {
        case 'date_range':
          validationExpressions.push({
            [rule.field]: {
              $gte: new Date(rule.min),
              $lte: new Date(rule.max)
            }
          });
          break;

        case 'conditional_required':
          validationExpressions.push({
            $or: [
              { [rule.condition.field]: { $ne: rule.condition.value } },
              { [rule.requiredField]: { $exists: true, $ne: null } }
            ]
          });
          break;

        case 'mutual_exclusion':
          validationExpressions.push({
            $or: rule.fields.map(field => ({ [field]: { $exists: false } }))
              .concat([
                { $expr: { 
                  $lte: [
                    { $size: { $filter: {
                      input: rule.fields,
                      cond: { $ne: [`$$this`, null] }
                    }}},
                    1
                  ]
                }}
              ])
          });
          break;

        case 'cross_field_validation':
          validationExpressions.push({
            $expr: {
              [rule.operator]: [
                `$${rule.field1}`,
                `$${rule.field2}`
              ]
            }
          });
          break;
      }
    }

    return validationExpressions.length > 0 ? { $and: validationExpressions } : {};
  }

  async validateDataQuality(collectionName, qualityRules) {
    console.log(`Running data quality validation for ${collectionName}`);

    const collection = this.db.collection(collectionName);
    const qualityReport = {
      collection: collectionName,
      totalDocuments: await collection.countDocuments(),
      qualityIssues: [],
      qualityScore: 0
    };

    for (const rule of qualityRules) {
      const issueCount = await collection.countDocuments(rule.condition);

      if (issueCount > 0) {
        qualityReport.qualityIssues.push({
          rule: rule.name,
          description: rule.description,
          affectedDocuments: issueCount,
          severity: rule.severity,
          suggestion: rule.suggestion
        });
      }
    }

    // Calculate quality score
    const totalIssues = qualityReport.qualityIssues.reduce((sum, issue) => sum + issue.affectedDocuments, 0);
    qualityReport.qualityScore = Math.max(0, 100 - (totalIssues / qualityReport.totalDocuments * 100));

    return qualityReport;
  }
}

SQL-Style Document Validation with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB document validation and schema management:

-- QueryLeaf document validation with SQL-familiar constraints

-- Create table with comprehensive validation rules
CREATE TABLE users (
  _id ObjectId PRIMARY KEY,
  email VARCHAR(255) NOT NULL UNIQUE 
    CHECK (email REGEXP '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'),
  username VARCHAR(30) NOT NULL UNIQUE 
    CHECK (username REGEXP '^[a-zA-Z0-9_-]+$' AND LENGTH(username) >= 3),
  password_hash CHAR(60) NOT NULL,

  -- Nested object validation with JSON schema
  profile JSONB NOT NULL CHECK (
    JSON_VALID(profile) AND
    JSON_EXTRACT(profile, '$.first_name') IS NOT NULL AND
    JSON_EXTRACT(profile, '$.last_name') IS NOT NULL AND
    LENGTH(JSON_UNQUOTE(JSON_EXTRACT(profile, '$.first_name'))) >= 1 AND
    LENGTH(JSON_UNQUOTE(JSON_EXTRACT(profile, '$.last_name'))) >= 1
  ),

  -- Complex nested preferences with validation
  preferences JSONB CHECK (
    JSON_VALID(preferences) AND
    JSON_EXTRACT(preferences, '$.notifications.email.frequency') IN ('immediate', 'daily', 'weekly', 'never') AND
    JSON_EXTRACT(preferences, '$.privacy.profile_visibility') IN ('public', 'friends', 'private') AND
    JSON_EXTRACT(preferences, '$.interface.theme') IN ('light', 'dark', 'auto')
  ),

  -- Account information with business logic validation
  account JSONB NOT NULL CHECK (
    JSON_VALID(account) AND
    JSON_EXTRACT(account, '$.status') IN ('active', 'inactive', 'suspended', 'pending') AND
    JSON_EXTRACT(account, '$.type') IN ('free', 'premium', 'enterprise', 'admin') AND
    (
      JSON_EXTRACT(account, '$.type') != 'premium' OR 
      JSON_EXTRACT(account, '$.subscription_expires_at') IS NOT NULL
    )
  ),

  -- Audit timestamps with constraints
  created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  deleted_at TIMESTAMP NULL,

  -- Complex business logic constraints
  CONSTRAINT valid_birth_date CHECK (
    JSON_EXTRACT(profile, '$.birth_date') IS NULL OR
    JSON_EXTRACT(profile, '$.birth_date') <= CURRENT_DATE
  ),

  CONSTRAINT profile_completeness CHECK (
    (JSON_EXTRACT(account, '$.type') != 'premium') OR
    (
      JSON_EXTRACT(profile, '$.phone_number') IS NOT NULL AND
      JSON_EXTRACT(profile, '$.bio') IS NOT NULL
    )
  ),

  -- Conditional validation based on account type
  CONSTRAINT admin_verification CHECK (
    (JSON_EXTRACT(account, '$.type') != 'admin') OR
    (JSON_EXTRACT(account, '$.verification.identity_verified') = true)
  )
) WITH (
  validation_level = 'strict',
  validation_action = 'error'
);

-- Insert data with comprehensive validation
INSERT INTO users (
  email, username, password_hash, profile, preferences, account
) VALUES (
  'john.doe@example.com',
  'johndoe123', 
  '$2b$12$LQv3c1yqBWVHxkd0LHAkCOYz6TtxMQJqhN8/LewdBxJzybKlJNcX.',
  JSON_OBJECT(
    'first_name', 'John',
    'last_name', 'Doe',
    'birth_date', '1990-05-15',
    'phone_number', '+1-555-123-4567',
    'bio', 'Software engineer passionate about technology',
    'social_links', JSON_OBJECT(
      'twitter', '@johndoe',
      'github', 'johndoe',
      'linkedin', 'john-doe-dev'
    )
  ),
  JSON_OBJECT(
    'notifications', JSON_OBJECT(
      'email', JSON_OBJECT(
        'marketing', false,
        'security', true,
        'frequency', 'daily'
      ),
      'push', JSON_OBJECT(
        'enabled', true,
        'frequency', 'immediate'
      )
    ),
    'privacy', JSON_OBJECT(
      'profile_visibility', 'friends',
      'search_visibility', true
    ),
    'interface', JSON_OBJECT(
      'theme', 'dark',
      'language', 'en-US',
      'timezone', 'America/New_York'
    )
  ),
  JSON_OBJECT(
    'status', 'active',
    'type', 'free',
    'verification', JSON_OBJECT(
      'email_verified', false,
      'verification_level', 'none'
    ),
    'security', JSON_OBJECT(
      'two_factor_enabled', false,
      'failed_login_attempts', 0
    )
  )
);

-- Advanced validation queries and data quality checks
WITH validation_analysis AS (
  SELECT 
    _id,
    email,
    username,

    -- Profile completeness scoring
    CASE 
      WHEN JSON_EXTRACT(profile, '$.bio') IS NOT NULL 
           AND JSON_EXTRACT(profile, '$.phone_number') IS NOT NULL
           AND JSON_EXTRACT(profile, '$.social_links') IS NOT NULL THEN 100
      WHEN JSON_EXTRACT(profile, '$.bio') IS NOT NULL 
           OR JSON_EXTRACT(profile, '$.phone_number') IS NOT NULL THEN 70
      WHEN JSON_EXTRACT(profile, '$.first_name') IS NOT NULL 
           AND JSON_EXTRACT(profile, '$.last_name') IS NOT NULL THEN 40
      ELSE 20
    END as profile_completeness_score,

    -- Preference configuration analysis
    CASE 
      WHEN JSON_EXTRACT(preferences, '$.notifications') IS NOT NULL
           AND JSON_EXTRACT(preferences, '$.privacy') IS NOT NULL
           AND JSON_EXTRACT(preferences, '$.interface') IS NOT NULL THEN 'complete'
      WHEN JSON_EXTRACT(preferences, '$.notifications') IS NOT NULL THEN 'partial'
      ELSE 'minimal'
    END as preferences_status,

    -- Account validation status
    JSON_EXTRACT(account, '$.status') as account_status,
    JSON_EXTRACT(account, '$.type') as account_type,
    JSON_EXTRACT(account, '$.verification.verification_level') as verification_level,

    -- Data quality flags
    JSON_VALID(profile) as profile_valid,
    JSON_VALID(preferences) as preferences_valid,
    JSON_VALID(account) as account_valid,

    -- Business rule compliance
    CASE 
      WHEN JSON_EXTRACT(account, '$.type') = 'premium' 
           AND JSON_EXTRACT(account, '$.subscription_expires_at') IS NULL THEN false
      ELSE true
    END as subscription_rule_compliant,

    created_at,
    updated_at

  FROM users
  WHERE deleted_at IS NULL
),

data_quality_report AS (
  SELECT 
    COUNT(*) as total_users,

    -- Profile quality metrics
    AVG(profile_completeness_score) as avg_profile_completeness,
    COUNT(*) FILTER (WHERE profile_completeness_score >= 80) as high_quality_profiles,
    COUNT(*) FILTER (WHERE profile_completeness_score < 50) as low_quality_profiles,

    -- Validation compliance
    COUNT(*) FILTER (WHERE profile_valid = false) as invalid_profiles,
    COUNT(*) FILTER (WHERE preferences_valid = false) as invalid_preferences,
    COUNT(*) FILTER (WHERE account_valid = false) as invalid_accounts,

    -- Business rule compliance
    COUNT(*) FILTER (WHERE subscription_rule_compliant = false) as subscription_violations,

    -- Account distribution
    COUNT(*) FILTER (WHERE account_type = 'free') as free_accounts,
    COUNT(*) FILTER (WHERE account_type = 'premium') as premium_accounts,
    COUNT(*) FILTER (WHERE account_type = 'enterprise') as enterprise_accounts,

    -- Verification status
    COUNT(*) FILTER (WHERE verification_level = 'none') as unverified_users,
    COUNT(*) FILTER (WHERE verification_level IN ('email', 'phone', 'identity', 'full')) as verified_users,

    -- Recent activity
    COUNT(*) FILTER (WHERE created_at >= CURRENT_DATE - INTERVAL '30 days') as new_users_30d,
    COUNT(*) FILTER (WHERE updated_at >= CURRENT_DATE - INTERVAL '7 days') as active_users_7d

  FROM validation_analysis
)

SELECT 
  total_users,
  ROUND(avg_profile_completeness, 1) as avg_profile_quality,
  ROUND((high_quality_profiles / total_users::float * 100), 1) as high_quality_pct,
  ROUND((low_quality_profiles / total_users::float * 100), 1) as low_quality_pct,

  -- Data integrity summary
  CASE 
    WHEN (invalid_profiles + invalid_preferences + invalid_accounts) = 0 THEN 'excellent'
    WHEN (invalid_profiles + invalid_preferences + invalid_accounts) < total_users * 0.01 THEN 'good'
    WHEN (invalid_profiles + invalid_preferences + invalid_accounts) < total_users * 0.05 THEN 'acceptable'
    ELSE 'poor'
  END as data_integrity_status,

  -- Business rule compliance
  CASE 
    WHEN subscription_violations = 0 THEN 'compliant'
    WHEN subscription_violations < total_users * 0.01 THEN 'minor_issues'
    ELSE 'major_violations'
  END as business_rule_compliance,

  -- Account distribution summary
  JSON_OBJECT(
    'free', free_accounts,
    'premium', premium_accounts, 
    'enterprise', enterprise_accounts
  ) as account_distribution,

  -- Verification summary
  ROUND((verified_users / total_users::float * 100), 1) as verification_rate_pct,

  -- Growth metrics
  new_users_30d,
  active_users_7d,

  -- Recommendations
  CASE 
    WHEN low_quality_profiles > total_users * 0.3 THEN 'Focus on profile completion campaigns'
    WHEN unverified_users > total_users * 0.5 THEN 'Improve verification processes'
    WHEN subscription_violations > 0 THEN 'Review premium account management'
    ELSE 'Data quality is good'
  END as primary_recommendation

FROM data_quality_report;

-- Schema evolution with validation migration
-- Add new validation rules with backward compatibility
ALTER TABLE users 
ADD CONSTRAINT enhanced_email_validation CHECK (
  email REGEXP '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$' AND
  email NOT LIKE '%@example.com' AND
  email NOT LIKE '%@test.%' AND
  LENGTH(email) >= 5 AND
  LENGTH(email) <= 254
);

-- Modify existing constraints with migration support
ALTER TABLE users 
MODIFY CONSTRAINT profile_completeness CHECK (
  (JSON_EXTRACT(account, '$.type') NOT IN ('premium', 'enterprise')) OR
  (
    JSON_EXTRACT(profile, '$.phone_number') IS NOT NULL AND
    JSON_EXTRACT(profile, '$.bio') IS NOT NULL AND
    JSON_EXTRACT(profile, '$.social_links') IS NOT NULL
  )
);

-- Add conditional validation based on account age
ALTER TABLE users
ADD CONSTRAINT mature_account_validation CHECK (
  (DATEDIFF(CURRENT_DATE, created_at) < 30) OR
  (
    JSON_EXTRACT(account, '$.verification.email_verified') = true AND
    profile_completeness_score >= 60
  )
);

-- Create validation monitoring view
CREATE VIEW user_validation_status AS
SELECT 
  _id,
  email,
  username,
  JSON_EXTRACT(account, '$.status') as status,
  JSON_EXTRACT(account, '$.type') as type,

  -- Validation status flags
  JSON_VALID(profile) as profile_structure_valid,
  JSON_VALID(preferences) as preferences_structure_valid,
  JSON_VALID(account) as account_structure_valid,

  -- Business rule compliance checks
  (
    JSON_EXTRACT(account, '$.type') != 'premium' OR 
    JSON_EXTRACT(account, '$.subscription_expires_at') IS NOT NULL
  ) as subscription_valid,

  (
    JSON_EXTRACT(account, '$.type') != 'admin' OR
    JSON_EXTRACT(account, '$.verification.identity_verified') = true
  ) as admin_verification_valid,

  -- Data completeness assessment  
  CASE 
    WHEN JSON_EXTRACT(profile, '$.first_name') IS NULL THEN 'missing_required_profile_data'
    WHEN JSON_EXTRACT(profile, '$.phone_number') IS NULL 
         AND JSON_EXTRACT(account, '$.type') IN ('premium', 'enterprise') THEN 'incomplete_premium_profile'
    WHEN email NOT REGEXP '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$' THEN 'invalid_email_format'
    ELSE 'valid'
  END as validation_status,

  created_at,
  updated_at

FROM users
WHERE deleted_at IS NULL;

-- QueryLeaf provides comprehensive document validation capabilities:
-- 1. SQL-familiar constraint syntax with CHECK clauses and business logic
-- 2. JSON validation functions for nested object and array validation  
-- 3. Conditional validation based on field values and account types
-- 4. Complex business rule enforcement through constraint expressions
-- 5. Schema evolution support with backward compatibility options
-- 6. Data quality monitoring and validation status reporting
-- 7. Integration with MongoDB's native document validation features
-- 8. Familiar SQL patterns for constraint management and modification
-- 9. Real-time validation feedback and error handling
-- 10. Comprehensive validation reporting and compliance tracking

Best Practices for Document Validation Implementation

Validation Strategy Design

Essential principles for effective MongoDB document validation:

  1. Progressive Validation: Start with loose validation and progressively tighten rules as data quality improves
  2. Business Rule Integration: Embed business logic directly into validation rules for consistency
  3. Schema Versioning: Implement versioning strategies for smooth schema evolution
  4. Performance Consideration: Balance validation thoroughness with insertion performance
  5. Error Handling: Design clear, actionable error messages for validation failures
  6. Testing Strategy: Thoroughly test validation rules with edge cases and invalid data

Production Implementation

Optimize MongoDB document validation for production environments:

  1. Validation Levels: Use appropriate validation levels (strict, moderate, off) for different environments
  2. Migration Planning: Plan validation changes with proper testing and rollback strategies
  3. Performance Monitoring: Monitor validation impact on write performance and throughput
  4. Data Quality Tracking: Implement comprehensive data quality monitoring and alerting
  5. Documentation: Maintain clear documentation of validation rules and business logic
  6. Compliance Integration: Align validation rules with regulatory and compliance requirements

Conclusion

MongoDB Document Validation provides the perfect balance between schema flexibility and data integrity, enabling applications to evolve rapidly while maintaining data quality and consistency. The powerful validation system supports complex business logic, nested object validation, and gradual schema evolution without the rigid constraints and expensive migrations of traditional relational systems.

Key MongoDB Document Validation benefits include:

  • Flexible Schema Evolution: Modify validation rules without downtime or complex migrations
  • Rich Validation Logic: Support for complex business rules, nested objects, and conditional constraints
  • JSON Schema Standard: Industry-standard validation with MongoDB-specific enhancements
  • Performance Integration: Validation optimizations that work with MongoDB's document architecture
  • Development Agility: Real-time validation feedback that accelerates development cycles
  • Data Quality Assurance: Comprehensive validation reporting and quality monitoring capabilities

Whether you're building user management systems, e-commerce platforms, content management applications, or any system requiring reliable data integrity with flexible schema design, MongoDB Document Validation with QueryLeaf's familiar SQL interface provides the foundation for robust, maintainable data validation.

QueryLeaf Integration: QueryLeaf automatically handles MongoDB document validation while providing SQL-familiar constraint syntax, validation functions, and schema management operations. Complex validation rules, business logic constraints, and data quality monitoring are seamlessly managed through familiar SQL constructs, making sophisticated document validation both powerful and accessible to SQL-oriented development teams.

The combination of flexible document validation with SQL-style operations makes MongoDB an ideal platform for applications requiring both rigorous data integrity and rapid schema evolution, ensuring your applications can adapt to changing requirements while maintaining the highest standards of data quality and consistency.

MongoDB Indexing Strategies and Performance Optimization: Advanced Techniques for High-Performance Database Operations

High-performance database applications depend heavily on strategic indexing to deliver fast query response times, efficient data retrieval, and optimal resource utilization. Poor indexing decisions can lead to slow queries, excessive memory usage, and degraded application performance that becomes increasingly problematic as data volumes grow.

MongoDB's flexible indexing system provides powerful capabilities for optimizing query performance across diverse data patterns and access scenarios. Unlike rigid relational indexing approaches, MongoDB indexes support complex document structures, array fields, geospatial data, and text search, enabling sophisticated optimization strategies that align with modern application requirements while maintaining query performance at scale.

The Traditional Database Indexing Limitations

Conventional relational database indexing approaches have significant constraints for modern application patterns:

-- Traditional PostgreSQL indexing - rigid structure with limited flexibility

-- Basic single-column indexes with limited optimization potential
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_users_created_at ON users(created_at);
CREATE INDEX idx_users_status ON users(status);
CREATE INDEX idx_users_country ON users(country);

-- Simple compound index with fixed column order
CREATE INDEX idx_users_country_status_created ON users(country, status, created_at);

-- Basic partial index (PostgreSQL specific)
CREATE INDEX idx_active_users_email ON users(email) 
WHERE status = 'active';

-- Limited text search capabilities
CREATE INDEX idx_users_name_fts ON users 
USING GIN(to_tsvector('english', first_name || ' ' || last_name));

-- Complex query with multiple conditions
WITH user_search AS (
  SELECT 
    user_id,
    email,
    first_name,
    last_name,
    status,
    country,
    created_at,
    last_login_at,

    -- Multiple index usage may not be optimal
    CASE 
      WHEN status = 'active' AND last_login_at >= CURRENT_DATE - INTERVAL '30 days' THEN 'active_recent'
      WHEN status = 'active' AND last_login_at < CURRENT_DATE - INTERVAL '30 days' THEN 'active_stale'
      WHEN status = 'inactive' THEN 'inactive'
      ELSE 'pending'
    END as user_category,

    -- Basic scoring for relevance
    CASE country
      WHEN 'US' THEN 3
      WHEN 'CA' THEN 2  
      WHEN 'UK' THEN 2
      ELSE 1
    END as priority_score

  FROM users
  WHERE 
    -- Multiple WHERE conditions that may require different indexes
    status IN ('active', 'inactive') 
    AND country IN ('US', 'CA', 'UK', 'AU', 'DE')
    AND created_at >= CURRENT_DATE - INTERVAL '2 years'
    AND (
      email ILIKE '%@company.com' OR 
      first_name ILIKE 'John%' OR
      last_name ILIKE 'Smith%'
    )
),

user_enrichment AS (
  SELECT 
    us.*,

    -- Subquery requiring additional index
    (SELECT COUNT(*) 
     FROM orders o 
     WHERE o.user_id = us.user_id 
       AND o.created_at >= CURRENT_DATE - INTERVAL '1 year'
    ) as orders_last_year,

    -- Another subquery with different access pattern
    (SELECT SUM(total_amount) 
     FROM orders o 
     WHERE o.user_id = us.user_id 
       AND o.status = 'completed'
    ) as total_spent,

    -- JSON field access (limited optimization)
    preferences->>'theme' as preferred_theme,
    preferences->>'language' as preferred_language,

    -- Array field contains check (poor performance without GIN)
    CASE 
      WHEN tags && ARRAY['premium', 'vip'] THEN true 
      ELSE false 
    END as is_premium_user

  FROM user_search us
),

final_results AS (
  SELECT 
    ue.user_id,
    ue.email,
    ue.first_name,
    ue.last_name,
    ue.status,
    ue.country,
    ue.user_category,
    ue.priority_score,
    ue.orders_last_year,
    ue.total_spent,
    ue.preferred_theme,
    ue.preferred_language,
    ue.is_premium_user,

    -- Complex ranking calculation
    (ue.priority_score * 0.3 + 
     CASE 
       WHEN ue.orders_last_year > 10 THEN 5
       WHEN ue.orders_last_year > 5 THEN 3
       WHEN ue.orders_last_year > 0 THEN 1
       ELSE 0
     END * 0.4 +
     CASE
       WHEN ue.total_spent > 1000 THEN 5
       WHEN ue.total_spent > 500 THEN 3
       WHEN ue.total_spent > 100 THEN 1
       ELSE 0
     END * 0.3
    ) as relevance_score,

    -- Row number for pagination
    ROW_NUMBER() OVER (
      ORDER BY 
        ue.priority_score DESC,
        ue.orders_last_year DESC,
        ue.total_spent DESC,
        ue.created_at DESC
    ) as row_num,

    COUNT(*) OVER () as total_results

  FROM user_enrichment ue
  WHERE ue.orders_last_year > 0 OR ue.total_spent > 50
)

SELECT 
  user_id,
  email,
  first_name || ' ' || last_name as full_name,
  status,
  country,
  user_category,
  orders_last_year,
  ROUND(total_spent::numeric, 2) as total_spent,
  is_premium_user,
  ROUND(relevance_score::numeric, 2) as relevance_score,
  row_num,
  total_results

FROM final_results
WHERE row_num BETWEEN 1 AND 50
ORDER BY relevance_score DESC, row_num ASC;

-- PostgreSQL indexing problems:
-- 1. Fixed column order in compound indexes limits query flexibility
-- 2. Limited support for JSON field indexing and optimization  
-- 3. Poor performance with array field operations and contains queries
-- 4. Complex partial index syntax with limited conditional logic
-- 5. Inefficient handling of multi-field text search scenarios
-- 6. Index maintenance overhead increases significantly with table size
-- 7. Limited support for dynamic query patterns and field combinations
-- 8. Poor integration with application-level data structures
-- 9. Complex index selection logic requires deep database expertise
-- 10. Inflexible index types for specialized data patterns (geo, time-series)

-- Additional index requirements for above query
CREATE INDEX idx_users_compound_search ON users(status, country, created_at) 
WHERE status IN ('active', 'inactive');

CREATE INDEX idx_users_email_pattern ON users(email) 
WHERE email LIKE '%@company.com';

CREATE INDEX idx_users_name_pattern ON users(first_name, last_name) 
WHERE first_name LIKE 'John%' OR last_name LIKE 'Smith%';

CREATE INDEX idx_orders_user_recent ON orders(user_id, created_at) 
WHERE created_at >= CURRENT_DATE - INTERVAL '1 year';

CREATE INDEX idx_orders_user_completed ON orders(user_id, total_amount) 
WHERE status = 'completed';

-- JSON field indexing (limited capabilities)
CREATE INDEX idx_users_preferences_gin ON users USING GIN(preferences);

-- Array field indexing  
CREATE INDEX idx_users_tags_gin ON users USING GIN(tags);

-- MySQL approach (even more limited)
-- Basic indexes only
CREATE INDEX idx_mysql_users_email ON mysql_users(email);
CREATE INDEX idx_mysql_users_status_country ON mysql_users(status, country);
CREATE INDEX idx_mysql_users_created ON mysql_users(created_at);

-- Limited JSON support in older versions
-- ALTER TABLE mysql_users ADD INDEX idx_preferences ((preferences->>'$.theme'));

-- Basic query with limited optimization
SELECT 
  user_id,
  email,
  first_name,
  last_name,
  status,
  country,
  created_at
FROM mysql_users
WHERE status = 'active' 
  AND country IN ('US', 'CA')
  AND created_at >= DATE_SUB(CURDATE(), INTERVAL 1 YEAR)
ORDER BY created_at DESC
LIMIT 50;

-- MySQL limitations:
-- - Very limited JSON indexing capabilities
-- - No partial indexes or conditional indexing
-- - Basic compound index support with rigid column ordering
-- - Poor performance with complex queries and joins
-- - Limited text search capabilities without additional engines
-- - Minimal support for array operations and specialized data types
-- - Simple index optimization with limited query planner sophistication

MongoDB's advanced indexing system provides comprehensive optimization capabilities:

// MongoDB Advanced Indexing - flexible, powerful, and application-optimized
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('user_analytics_platform');

// Advanced MongoDB indexing strategy manager
class MongoDBIndexingManager {
  constructor(db) {
    this.db = db;
    this.collections = {
      users: db.collection('users'),
      orders: db.collection('orders'),
      products: db.collection('products'),
      analytics: db.collection('analytics'),
      indexMetrics: db.collection('index_metrics')
    };
    this.indexingStrategies = new Map();
    this.performanceTargets = {
      maxQueryTime: 100, // milliseconds
      maxIndexSize: 1024, // MB
      minSelectivity: 0.01 // 1% selectivity threshold
    };
  }

  async createComprehensiveIndexingStrategy() {
    console.log('Creating comprehensive MongoDB indexing strategy...');

    // 1. Single field indexes for basic queries
    await this.createSingleFieldIndexes();

    // 2. Compound indexes for complex multi-field queries
    await this.createCompoundIndexes();

    // 3. Partial indexes for filtered queries
    await this.createPartialIndexes();

    // 4. Text indexes for search functionality
    await this.createTextSearchIndexes();

    // 5. Geospatial indexes for location-based queries
    await this.createGeospatialIndexes();

    // 6. Sparse indexes for optional fields
    await this.createSparseIndexes();

    // 7. TTL indexes for data expiration
    await this.createTTLIndexes();

    // 8. Wildcard indexes for flexible schemas
    await this.createWildcardIndexes();

    console.log('Comprehensive indexing strategy implemented successfully');
  }

  async createSingleFieldIndexes() {
    console.log('Creating optimized single field indexes...');

    const userIndexes = [
      // High-cardinality unique fields
      { email: 1 }, // Unique identifier, high selectivity
      { username: 1 }, // Unique identifier, high selectivity

      // High-frequency filter fields
      { status: 1 }, // Limited values but frequently queried
      { country: 1 }, // Geographic filtering
      { accountType: 1 }, // User segmentation

      // Temporal fields for range queries
      { createdAt: 1 }, // Registration date queries
      { lastLoginAt: 1 }, // Activity-based filtering
      { subscriptionExpiresAt: 1 }, // Subscription management

      // Numerical fields for range and sort operations
      { totalSpent: -1 }, // Customer value analysis (descending)
      { loyaltyPoints: -1 }, // Rewards program queries
      { riskScore: 1 } // Security and fraud detection
    ];

    for (const indexSpec of userIndexes) {
      const fieldName = Object.keys(indexSpec)[0];
      const indexName = `idx_users_${fieldName}`;

      try {
        await this.collections.users.createIndex(indexSpec, {
          name: indexName,
          background: true,
          // Add performance hints
          partialFilterExpression: this.getPartialFilterForField(fieldName)
        });

        console.log(`Created single field index: ${indexName}`);
        await this.recordIndexMetrics(indexName, 'single_field', indexSpec);

      } catch (error) {
        console.error(`Failed to create index ${indexName}:`, error);
      }
    }

    // Order indexes for e-commerce scenarios
    const orderIndexes = [
      { userId: 1 }, // Customer order lookup
      { status: 1 }, // Order status filtering
      { createdAt: -1 }, // Recent orders first
      { totalAmount: -1 }, // High-value orders
      { paymentStatus: 1 }, // Payment tracking
      { shippingMethod: 1 } // Fulfillment queries
    ];

    for (const indexSpec of orderIndexes) {
      const fieldName = Object.keys(indexSpec)[0];
      const indexName = `idx_orders_${fieldName}`;

      await this.collections.orders.createIndex(indexSpec, {
        name: indexName,
        background: true
      });

      console.log(`Created order index: ${indexName}`);
    }
  }

  async createCompoundIndexes() {
    console.log('Creating optimized compound indexes...');

    // User compound indexes following ESR (Equality, Sort, Range) rule
    const userCompoundIndexes = [
      {
        name: 'idx_users_country_status_created',
        spec: { country: 1, status: 1, createdAt: -1 },
        purpose: 'Geographic user filtering with status and recency',
        queryPatterns: ['country + status filters', 'country + status + date range']
      },
      {
        name: 'idx_users_status_activity_spent',
        spec: { status: 1, lastLoginAt: -1, totalSpent: -1 },
        purpose: 'Active user analysis with spending patterns',
        queryPatterns: ['status + activity analysis', 'customer value segmentation']
      },
      {
        name: 'idx_users_type_tier_points',
        spec: { accountType: 1, loyaltyTier: 1, loyaltyPoints: -1 },
        purpose: 'Customer segmentation and loyalty program queries',
        queryPatterns: ['loyalty program analysis', 'customer tier management']
      },
      {
        name: 'idx_users_email_verification_created',
        spec: { 'verification.email': 1, 'verification.phone': 1, createdAt: -1 },
        purpose: 'User verification status with registration timeline',
        queryPatterns: ['verification status queries', 'onboarding analytics']
      },
      {
        name: 'idx_users_preferences_activity',
        spec: { 'preferences.marketing': 1, 'preferences.notifications': 1, lastLoginAt: -1 },
        purpose: 'Marketing segmentation with activity correlation',
        queryPatterns: ['marketing campaign targeting', 'notification preferences']
      }
    ];

    for (const indexConfig of userCompoundIndexes) {
      try {
        await this.collections.users.createIndex(indexConfig.spec, {
          name: indexConfig.name,
          background: true
        });

        console.log(`Created compound index: ${indexConfig.name}`);
        console.log(`  Purpose: ${indexConfig.purpose}`);
        console.log(`  Query patterns: ${indexConfig.queryPatterns.join(', ')}`);

        await this.recordIndexMetrics(indexConfig.name, 'compound', indexConfig.spec, {
          purpose: indexConfig.purpose,
          queryPatterns: indexConfig.queryPatterns
        });

      } catch (error) {
        console.error(`Failed to create compound index ${indexConfig.name}:`, error);
      }
    }

    // Order compound indexes for e-commerce analytics
    const orderCompoundIndexes = [
      {
        name: 'idx_orders_user_status_date',
        spec: { userId: 1, status: 1, createdAt: -1 },
        purpose: 'Customer order history with status filtering'
      },
      {
        name: 'idx_orders_status_payment_amount',
        spec: { status: 1, paymentStatus: 1, totalAmount: -1 },
        purpose: 'Revenue analysis and payment processing queries'
      },
      {
        name: 'idx_orders_product_date_amount',
        spec: { 'items.productId': 1, createdAt: -1, totalAmount: -1 },
        purpose: 'Product performance analysis with sales trends'
      },
      {
        name: 'idx_orders_shipping_region_date',
        spec: { 'shippingAddress.country': 1, 'shippingAddress.state': 1, createdAt: -1 },
        purpose: 'Geographic sales analysis and shipping optimization'
      }
    ];

    for (const indexConfig of orderCompoundIndexes) {
      await this.collections.orders.createIndex(indexConfig.spec, {
        name: indexConfig.name,
        background: true
      });

      console.log(`Created order compound index: ${indexConfig.name}`);
    }
  }

  async createPartialIndexes() {
    console.log('Creating partial indexes for filtered queries...');

    const partialIndexes = [
      {
        name: 'idx_users_active_email',
        collection: 'users',
        spec: { email: 1 },
        filter: { status: 'active' },
        purpose: 'Active user email lookups (reduces index size by ~70%)'
      },
      {
        name: 'idx_users_premium_spending',
        collection: 'users', 
        spec: { totalSpent: -1, loyaltyPoints: -1 },
        filter: { accountType: 'premium' },
        purpose: 'Premium customer analysis and loyalty tracking'
      },
      {
        name: 'idx_users_recent_active',
        collection: 'users',
        spec: { lastLoginAt: -1, country: 1 },
        filter: { 
          status: 'active',
          lastLoginAt: { $gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) }
        },
        purpose: 'Recently active users for engagement campaigns'
      },
      {
        name: 'idx_orders_high_value_completed',
        collection: 'orders',
        spec: { totalAmount: -1, createdAt: -1 },
        filter: { 
          status: 'completed',
          totalAmount: { $gte: 500 }
        },
        purpose: 'High-value completed orders for VIP customer analysis'
      },
      {
        name: 'idx_orders_pending_payment',
        collection: 'orders',
        spec: { createdAt: 1, userId: 1 },
        filter: {
          status: { $in: ['pending', 'processing'] },
          paymentStatus: 'pending'
        },
        purpose: 'Orders requiring payment processing attention'
      },
      {
        name: 'idx_users_verification_required',
        collection: 'users',
        spec: { createdAt: 1, riskScore: -1 },
        filter: {
          $or: [
            { 'verification.email': false },
            { 'verification.phone': false },
            { 'verification.identity': false }
          ]
        },
        purpose: 'Users requiring additional verification steps'
      }
    ];

    for (const partialIndex of partialIndexes) {
      try {
        const collection = this.collections[partialIndex.collection];

        await collection.createIndex(partialIndex.spec, {
          name: partialIndex.name,
          partialFilterExpression: partialIndex.filter,
          background: true
        });

        console.log(`Created partial index: ${partialIndex.name}`);
        console.log(`  Filter: ${JSON.stringify(partialIndex.filter)}`);
        console.log(`  Purpose: ${partialIndex.purpose}`);

        // Measure index size reduction
        const fullIndexStats = await this.estimateIndexSize(partialIndex.spec);
        const partialIndexStats = await collection.aggregate([
          { $match: partialIndex.filter },
          { $count: "documentCount" }
        ]).toArray();

        const reductionPercent = ((1 - (partialIndexStats[0]?.documentCount || 0) / fullIndexStats.documentCount) * 100).toFixed(1);
        console.log(`  Index size reduction: ~${reductionPercent}%`);

      } catch (error) {
        console.error(`Failed to create partial index ${partialIndex.name}:`, error);
      }
    }
  }

  async createTextSearchIndexes() {
    console.log('Creating text search indexes for full-text search...');

    const textIndexes = [
      {
        name: 'idx_users_fulltext_search',
        collection: 'users',
        spec: {
          firstName: 'text',
          lastName: 'text',
          email: 'text',
          'profile.bio': 'text',
          'profile.company': 'text'
        },
        weights: {
          firstName: 10,
          lastName: 10,
          email: 5,
          'profile.bio': 1,
          'profile.company': 3
        },
        purpose: 'Comprehensive user search across name, email, and profile data'
      },
      {
        name: 'idx_products_search',
        collection: 'products',
        spec: {
          name: 'text',
          description: 'text',
          brand: 'text',
          'tags': 'text',
          'specifications.features': 'text'
        },
        weights: {
          name: 20,
          brand: 15,
          tags: 10,
          description: 5,
          'specifications.features': 3
        },
        purpose: 'Product catalog search with relevance weighting'
      },
      {
        name: 'idx_orders_search',
        collection: 'orders',
        spec: {
          orderNumber: 'text',
          'customer.email': 'text',
          'items.productName': 'text',
          'shippingAddress.street': 'text',
          'shippingAddress.city': 'text'
        },
        weights: {
          orderNumber: 20,
          'customer.email': 15,
          'items.productName': 10,
          'shippingAddress.street': 3,
          'shippingAddress.city': 5
        },
        purpose: 'Order search by number, customer, products, or shipping details'
      }
    ];

    for (const textIndex of textIndexes) {
      try {
        const collection = this.collections[textIndex.collection];

        await collection.createIndex(textIndex.spec, {
          name: textIndex.name,
          weights: textIndex.weights,
          background: true,
          // Configure text search options
          default_language: 'english',
          language_override: 'language' // Field name for document language
        });

        console.log(`Created text search index: ${textIndex.name}`);
        console.log(`  Purpose: ${textIndex.purpose}`);
        console.log(`  Weighted fields: ${Object.keys(textIndex.weights).join(', ')}`);

      } catch (error) {
        console.error(`Failed to create text index ${textIndex.name}:`, error);
      }
    }
  }

  async createGeospatialIndexes() {
    console.log('Creating geospatial indexes for location-based queries...');

    const geoIndexes = [
      {
        name: 'idx_users_location_2dsphere',
        collection: 'users',
        spec: { 'location.coordinates': '2dsphere' },
        purpose: 'User location queries for proximity and regional analysis'
      },
      {
        name: 'idx_orders_shipping_location',
        collection: 'orders',
        spec: { 'shippingAddress.coordinates': '2dsphere' },
        purpose: 'Shipping destination analysis and route optimization'
      },
      {
        name: 'idx_stores_location_2dsphere',
        collection: 'stores',
        spec: { 'address.coordinates': '2dsphere' },
        purpose: 'Store locator and catchment area analysis'
      }
    ];

    for (const geoIndex of geoIndexes) {
      try {
        const collection = this.collections[geoIndex.collection] || this.db.collection(geoIndex.collection);

        await collection.createIndex(geoIndex.spec, {
          name: geoIndex.name,
          background: true,
          // 2dsphere specific options
          '2dsphereIndexVersion': 3 // Use latest version
        });

        console.log(`Created geospatial index: ${geoIndex.name}`);
        console.log(`  Purpose: ${geoIndex.purpose}`);

      } catch (error) {
        console.error(`Failed to create geo index ${geoIndex.name}:`, error);
      }
    }
  }

  async createSparseIndexes() {
    console.log('Creating sparse indexes for optional fields...');

    const sparseIndexes = [
      {
        name: 'idx_users_social_profiles_sparse',
        collection: 'users',
        spec: { 'socialProfiles.twitter': 1, 'socialProfiles.linkedin': 1 },
        purpose: 'Social media profile lookups (only for users with social profiles)'
      },
      {
        name: 'idx_users_subscription_sparse',
        collection: 'users',
        spec: { 'subscription.planId': 1, 'subscription.renewsAt': 1 },
        purpose: 'Subscription management (only for subscribed users)'
      },
      {
        name: 'idx_users_referral_sparse',
        collection: 'users',
        spec: { 'referral.code': 1, 'referral.referredBy': 1 },
        purpose: 'Referral program tracking (only for users in referral program)'
      },
      {
        name: 'idx_orders_tracking_sparse',
        collection: 'orders',
        spec: { 'shipping.trackingNumber': 1, 'shipping.carrier': 1 },
        purpose: 'Package tracking (only for shipped orders)'
      }
    ];

    for (const sparseIndex of sparseIndexes) {
      try {
        const collection = this.collections[sparseIndex.collection];

        await collection.createIndex(sparseIndex.spec, {
          name: sparseIndex.name,
          sparse: true, // Skip documents where indexed fields are missing
          background: true
        });

        console.log(`Created sparse index: ${sparseIndex.name}`);
        console.log(`  Purpose: ${sparseIndex.purpose}`);

      } catch (error) {
        console.error(`Failed to create sparse index ${sparseIndex.name}:`, error);
      }
    }
  }

  async createTTLIndexes() {
    console.log('Creating TTL indexes for automatic data expiration...');

    const ttlIndexes = [
      {
        name: 'idx_analytics_events_ttl',
        collection: 'analytics',
        spec: { createdAt: 1 },
        expireAfterSeconds: 30 * 24 * 60 * 60, // 30 days
        purpose: 'Automatic cleanup of analytics events after 30 days'
      },
      {
        name: 'idx_user_sessions_ttl',
        collection: 'userSessions',
        spec: { lastActivity: 1 },
        expireAfterSeconds: 7 * 24 * 60 * 60, // 7 days
        purpose: 'Session cleanup after 7 days of inactivity'
      },
      {
        name: 'idx_password_resets_ttl',
        collection: 'passwordResets',
        spec: { createdAt: 1 },
        expireAfterSeconds: 24 * 60 * 60, // 24 hours
        purpose: 'Password reset token expiration after 24 hours'
      },
      {
        name: 'idx_email_verification_ttl',
        collection: 'emailVerifications',
        spec: { createdAt: 1 },
        expireAfterSeconds: 7 * 24 * 60 * 60, // 7 days
        purpose: 'Email verification token cleanup after 7 days'
      }
    ];

    for (const ttlIndex of ttlIndexes) {
      try {
        const collection = this.db.collection(ttlIndex.collection);

        await collection.createIndex(ttlIndex.spec, {
          name: ttlIndex.name,
          expireAfterSeconds: ttlIndex.expireAfterSeconds,
          background: true
        });

        const expireDays = Math.round(ttlIndex.expireAfterSeconds / (24 * 60 * 60));
        console.log(`Created TTL index: ${ttlIndex.name} (expires after ${expireDays} days)`);
        console.log(`  Purpose: ${ttlIndex.purpose}`);

      } catch (error) {
        console.error(`Failed to create TTL index ${ttlIndex.name}:`, error);
      }
    }
  }

  async createWildcardIndexes() {
    console.log('Creating wildcard indexes for flexible schema queries...');

    const wildcardIndexes = [
      {
        name: 'idx_users_metadata_wildcard',
        collection: 'users',
        spec: { 'metadata.$**': 1 },
        purpose: 'Flexible querying of user metadata fields with varying schemas'
      },
      {
        name: 'idx_products_attributes_wildcard',
        collection: 'products',
        spec: { 'attributes.$**': 1 },
        purpose: 'Dynamic product attribute queries for catalog flexibility'
      },
      {
        name: 'idx_orders_customFields_wildcard',
        collection: 'orders',
        spec: { 'customFields.$**': 1 },
        purpose: 'Custom order fields for different business verticals'
      }
    ];

    for (const wildcardIndex of wildcardIndexes) {
      try {
        const collection = this.collections[wildcardIndex.collection] || this.db.collection(wildcardIndex.collection);

        await collection.createIndex(wildcardIndex.spec, {
          name: wildcardIndex.name,
          background: true,
          // Wildcard index options
          wildcardProjection: { 
            _id: 1 // Always include _id for efficiency
          }
        });

        console.log(`Created wildcard index: ${wildcardIndex.name}`);
        console.log(`  Purpose: ${wildcardIndex.purpose}`);

      } catch (error) {
        console.error(`Failed to create wildcard index ${wildcardIndex.name}:`, error);
      }
    }
  }

  async performQueryOptimizationAnalysis() {
    console.log('Performing comprehensive query optimization analysis...');

    const analysisResults = {
      slowQueries: [],
      indexUsage: [],
      recommendedIndexes: [],
      performanceMetrics: {}
    };

    // 1. Analyze slow queries from profiler data
    const slowQueries = await this.db.collection('system.profile').find({
      ts: { $gte: new Date(Date.now() - 24 * 60 * 60 * 1000) }, // Last 24 hours
      millis: { $gte: 100 } // Queries taking > 100ms
    }).sort({ millis: -1 }).limit(50).toArray();

    analysisResults.slowQueries = slowQueries.map(query => ({
      namespace: query.ns,
      duration: query.millis,
      command: query.command,
      executionStats: query.execStats,
      timestamp: query.ts,
      recommendation: this.generateOptimizationRecommendation(query)
    }));

    // 2. Analyze index usage statistics
    for (const collectionName of Object.keys(this.collections)) {
      const collection = this.collections[collectionName];

      try {
        const indexStats = await collection.aggregate([
          { $indexStats: {} }
        ]).toArray();

        const indexUsage = indexStats.map(stat => ({
          collection: collectionName,
          indexName: stat.name,
          usageCount: stat.accesses.ops,
          lastUsed: stat.accesses.since,
          size: stat.size,
          efficiency: this.calculateIndexEfficiency(stat)
        }));

        analysisResults.indexUsage.push(...indexUsage);

      } catch (error) {
        console.warn(`Could not get index stats for ${collectionName}:`, error.message);
      }
    }

    // 3. Generate index recommendations
    analysisResults.recommendedIndexes = await this.generateIndexRecommendations(analysisResults.slowQueries);

    // 4. Calculate performance metrics
    analysisResults.performanceMetrics = await this.calculatePerformanceMetrics();

    console.log('Query optimization analysis completed');

    // Store analysis results for historical tracking
    await this.collections.indexMetrics.insertOne({
      analysisType: 'query_optimization',
      timestamp: new Date(),
      results: analysisResults
    });

    return analysisResults;
  }

  generateOptimizationRecommendation(slowQuery) {
    const recommendations = [];

    // Check for missing indexes based on query pattern
    if (slowQuery.execStats?.executionStats?.stage === 'COLLSCAN') {
      recommendations.push('Query requires collection scan - consider adding index');
    }

    if (slowQuery.execStats?.executionStats?.stage === 'IXSCAN' && 
        slowQuery.execStats?.executionStats?.keysExamined > slowQuery.execStats?.executionStats?.docsExamined * 10) {
      recommendations.push('Index selectivity is poor - consider compound index or partial index');
    }

    // Check for sort optimization
    if (slowQuery.command?.sort && 
        slowQuery.execStats?.executionStats?.stages?.some(stage => stage.stage === 'SORT')) {
      recommendations.push('Sort operation not using index - add sort fields to index');
    }

    // Check for projection optimization
    if (slowQuery.command?.projection && Object.keys(slowQuery.command.projection).length < 5) {
      recommendations.push('Consider covered query with projection fields in index');
    }

    return recommendations.length > 0 ? recommendations : ['Query performance acceptable'];
  }

  calculateIndexEfficiency(indexStat) {
    // Calculate index efficiency based on usage patterns
    const size = indexStat.size || 0;
    const usage = indexStat.accesses?.ops || 0;
    const daysSinceCreated = (Date.now() - indexStat.creationTime) / (24 * 60 * 60 * 1000);

    // Efficiency metric: usage per day per MB
    const efficiency = usage / Math.max(daysSinceCreated, 1) / Math.max(size / (1024 * 1024), 1);

    return Math.round(efficiency * 100) / 100;
  }

  async generateIndexRecommendations(slowQueries) {
    const recommendations = [];
    const queryPatterns = new Map();

    // Analyze query patterns to suggest indexes
    for (const query of slowQueries) {
      const command = query.command;
      if (!command?.find && !command?.aggregate) continue;

      const collection = query.namespace.split('.')[1];
      const filter = command.find ? command.filter : 
                    command.aggregate?.[0]?.$match;

      if (filter) {
        const pattern = this.extractQueryPattern(filter);
        const key = `${collection}:${pattern}`;

        if (!queryPatterns.has(key)) {
          queryPatterns.set(key, {
            collection,
            pattern,
            frequency: 0,
            avgDuration: 0,
            queries: []
          });
        }

        const patternData = queryPatterns.get(key);
        patternData.frequency++;
        patternData.avgDuration = (patternData.avgDuration * (patternData.frequency - 1) + query.duration) / patternData.frequency;
        patternData.queries.push(query);
      }
    }

    // Generate recommendations based on frequent slow patterns
    for (const [key, patternData] of queryPatterns) {
      if (patternData.frequency >= 3 && patternData.avgDuration >= 100) {
        const recommendedIndex = this.generateIndexSpecFromPattern(patternData.pattern);

        recommendations.push({
          collection: patternData.collection,
          recommendedIndex,
          reason: `Frequent slow queries (${patternData.frequency} occurrences, avg ${patternData.avgDuration}ms)`,
          queryPattern: patternData.pattern,
          estimatedImprovement: this.estimatePerformanceImprovement(patternData)
        });
      }
    }

    return recommendations;
  }

  extractQueryPattern(filter) {
    // Extract query pattern for index recommendation
    const pattern = {};

    for (const [field, condition] of Object.entries(filter)) {
      if (field === '$and' || field === '$or') {
        // Handle logical operators
        pattern[field] = 'logical_operator';
      } else if (typeof condition === 'object' && condition !== null) {
        // Handle range/comparison queries
        const operators = Object.keys(condition);
        if (operators.some(op => ['$gt', '$gte', '$lt', '$lte'].includes(op))) {
          pattern[field] = 'range';
        } else if (operators.includes('$in')) {
          pattern[field] = 'in_list';
        } else if (operators.includes('$regex')) {
          pattern[field] = 'regex';
        } else {
          pattern[field] = 'equality';
        }
      } else {
        pattern[field] = 'equality';
      }
    }

    return JSON.stringify(pattern);
  }

  generateIndexSpecFromPattern(patternStr) {
    const pattern = JSON.parse(patternStr);
    const indexSpec = {};

    // Apply ESR (Equality, Sort, Range) rule
    const equalityFields = [];
    const rangeFields = [];

    for (const [field, type] of Object.entries(pattern)) {
      if (type === 'equality' || type === 'in_list') {
        equalityFields.push(field);
      } else if (type === 'range') {
        rangeFields.push(field);
      }
    }

    // Build index spec: equality fields first, then range fields
    for (const field of equalityFields) {
      indexSpec[field] = 1;
    }
    for (const field of rangeFields) {
      indexSpec[field] = 1;
    }

    return indexSpec;
  }

  estimatePerformanceImprovement(patternData) {
    // Estimate performance improvement based on query characteristics
    const baseImprovement = 50; // Base 50% improvement assumption

    // Higher improvement for collection scans
    if (patternData.queries.some(q => q.executionStats?.stage === 'COLLSCAN')) {
      return Math.min(90, baseImprovement + 30);
    }

    // Moderate improvement for index scans with poor selectivity
    if (patternData.avgDuration > 500) {
      return Math.min(80, baseImprovement + 20);
    }

    return baseImprovement;
  }

  async calculatePerformanceMetrics() {
    const metrics = {};

    try {
      // Get database stats
      const dbStats = await this.db.stats();
      metrics.totalIndexSize = dbStats.indexSize;
      metrics.totalDataSize = dbStats.dataSize;
      metrics.indexToDataRatio = (dbStats.indexSize / dbStats.dataSize * 100).toFixed(1) + '%';

      // Get collection-level metrics
      for (const collectionName of Object.keys(this.collections)) {
        const collection = this.collections[collectionName];
        const stats = await collection.stats();

        metrics[collectionName] = {
          documentCount: stats.count,
          avgDocumentSize: stats.avgObjSize,
          indexCount: stats.nindexes,
          totalIndexSize: stats.totalIndexSize,
          indexSizeRatio: (stats.totalIndexSize / stats.size * 100).toFixed(1) + '%'
        };
      }

    } catch (error) {
      console.warn('Could not calculate all performance metrics:', error.message);
    }

    return metrics;
  }

  async recordIndexMetrics(indexName, indexType, indexSpec, metadata = {}) {
    try {
      await this.collections.indexMetrics.insertOne({
        indexName,
        indexType,
        indexSpec,
        metadata,
        createdAt: new Date(),
        status: 'active'
      });
    } catch (error) {
      console.warn('Failed to record index metrics:', error.message);
    }
  }

  getPartialFilterForField(fieldName) {
    // Return appropriate partial filter expressions for common fields
    const partialFilters = {
      email: { email: { $exists: true, $ne: null } },
      lastLoginAt: { lastLoginAt: { $exists: true } },
      totalSpent: { totalSpent: { $gt: 0 } },
      riskScore: { riskScore: { $exists: true } }
    };

    return partialFilters[fieldName] || null;
  }

  async estimateIndexSize(indexSpec) {
    // Estimate index size based on collection statistics
    try {
      const collection = this.collections.users; // Default to users collection
      const sampleDoc = await collection.findOne();
      const stats = await collection.stats();

      if (sampleDoc && stats) {
        const avgDocSize = stats.avgObjSize;
        const fieldSize = this.estimateFieldSize(sampleDoc, Object.keys(indexSpec));
        const indexOverhead = fieldSize * 1.2; // 20% overhead for B-tree structure

        return {
          documentCount: stats.count,
          estimatedIndexSize: indexOverhead * stats.count,
          avgFieldSize: fieldSize
        };
      }
    } catch (error) {
      console.warn('Could not estimate index size:', error.message);
    }

    return { documentCount: 0, estimatedIndexSize: 0, avgFieldSize: 0 };
  }

  estimateFieldSize(document, fieldNames) {
    let totalSize = 0;

    for (const fieldName of fieldNames) {
      const value = this.getNestedValue(document, fieldName);
      totalSize += this.calculateValueSize(value);
    }

    return totalSize;
  }

  getNestedValue(obj, path) {
    return path.split('.').reduce((current, key) => current?.[key], obj);
  }

  calculateValueSize(value) {
    if (value === null || value === undefined) return 0;
    if (typeof value === 'string') return value.length * 2; // UTF-8 overhead
    if (typeof value === 'number') return 8; // 64-bit numbers
    if (typeof value === 'boolean') return 1;
    if (value instanceof Date) return 8;
    if (Array.isArray(value)) return value.reduce((sum, item) => sum + this.calculateValueSize(item), 0);
    if (typeof value === 'object') return Object.values(value).reduce((sum, val) => sum + this.calculateValueSize(val), 0);

    return 50; // Default estimate for unknown types
  }

  async optimizeExistingIndexes() {
    console.log('Optimizing existing indexes...');

    const optimizationResults = {
      rebuiltIndexes: [],
      droppedIndexes: [],
      recommendations: []
    };

    for (const collectionName of Object.keys(this.collections)) {
      const collection = this.collections[collectionName];

      try {
        // Get current indexes
        const indexes = await collection.indexes();
        const indexStats = await collection.aggregate([{ $indexStats: {} }]).toArray();

        for (const index of indexes) {
          if (index.name === '_id_') continue; // Skip default _id index

          const stat = indexStats.find(s => s.name === index.name);
          const usage = stat?.accesses?.ops || 0;
          const daysSinceCreated = stat ? (Date.now() - stat.accesses.since) / (24 * 60 * 60 * 1000) : 0;

          // Check for unused indexes (no usage in 30 days)
          if (daysSinceCreated > 30 && usage === 0) {
            console.log(`Dropping unused index: ${index.name} in ${collectionName}`);
            await collection.dropIndex(index.name);
            optimizationResults.droppedIndexes.push({
              collection: collectionName,
              indexName: index.name,
              reason: 'Unused for 30+ days'
            });
          }

          // Check for low-efficiency indexes
          const efficiency = stat ? this.calculateIndexEfficiency(stat) : 0;
          if (efficiency < 0.1 && usage > 0) {
            optimizationResults.recommendations.push({
              collection: collectionName,
              indexName: index.name,
              recommendation: 'Low efficiency - consider redesigning or adding partial filter',
              currentEfficiency: efficiency
            });
          }
        }

      } catch (error) {
        console.error(`Error optimizing indexes for ${collectionName}:`, error);
      }
    }

    console.log('Index optimization completed');
    return optimizationResults;
  }
}

// Benefits of MongoDB Advanced Indexing:
// - Flexible compound indexes with optimal field ordering for complex queries
// - Partial indexes that dramatically reduce index size and improve performance
// - Text search indexes with weighted relevance and language support
// - Geospatial indexes for location-based queries and proximity searches
// - Sparse indexes for optional fields that save storage and improve efficiency
// - TTL indexes for automatic data lifecycle management
// - Wildcard indexes for dynamic schema flexibility
// - Real-time index usage analysis and optimization recommendations
// - Integration with query profiler for performance bottleneck identification
// - Sophisticated index strategy management with automated optimization

module.exports = {
  MongoDBIndexingManager
};

Understanding MongoDB Indexing Architecture

Advanced Index Design Patterns and Strategies

Implement sophisticated indexing patterns for optimal query performance:

// Advanced indexing patterns for specialized use cases
class AdvancedIndexingPatterns {
  constructor(db) {
    this.db = db;
    this.performanceTargets = {
      maxQueryTime: 50, // milliseconds for standard queries
      maxComplexQueryTime: 200, // milliseconds for complex analytical queries
      maxIndexSizeRatio: 0.3 // Index size should not exceed 30% of data size
    };
  }

  async implementCoveredQueryOptimization() {
    console.log('Implementing covered query optimization patterns...');

    // Covered queries that can be satisfied entirely from index
    const coveredQueryIndexes = [
      {
        name: 'idx_user_dashboard_covered',
        collection: 'users',
        spec: { 
          status: 1, 
          country: 1, 
          email: 1, 
          firstName: 1, 
          lastName: 1, 
          totalSpent: 1,
          loyaltyPoints: 1,
          createdAt: 1 
        },
        purpose: 'Cover user dashboard queries without document retrieval',
        coveredQueries: [
          'User listing with basic info and spending',
          'Geographic user distribution',
          'Customer segmentation queries'
        ]
      },
      {
        name: 'idx_order_summary_covered',
        collection: 'orders', 
        spec: {
          userId: 1,
          status: 1,
          totalAmount: 1,
          createdAt: 1,
          paymentStatus: 1,
          'shipping.method': 1
        },
        purpose: 'Cover order summary queries for customer service',
        coveredQueries: [
          'Customer order history summaries',
          'Revenue reporting by status and date',
          'Shipping method analysis'
        ]
      }
    ];

    for (const coveredIndex of coveredQueryIndexes) {
      const collection = this.db.collection(coveredIndex.collection);

      await collection.createIndex(coveredIndex.spec, {
        name: coveredIndex.name,
        background: true
      });

      console.log(`Created covered query index: ${coveredIndex.name}`);
      console.log(`  Covered queries: ${coveredIndex.coveredQueries.join(', ')}`);
    }
  }

  async implementHashedIndexingStrategy() {
    console.log('Implementing hashed indexing for sharded collections...');

    // Hashed indexes for even distribution across shards
    const hashedIndexes = [
      {
        name: 'idx_users_id_hashed',
        collection: 'users',
        spec: { _id: 'hashed' },
        purpose: 'Even distribution of users across shards'
      },
      {
        name: 'idx_orders_customer_hashed', 
        collection: 'orders',
        spec: { userId: 'hashed' },
        purpose: 'Distribute customer orders evenly across shards'
      },
      {
        name: 'idx_analytics_session_hashed',
        collection: 'analytics',
        spec: { sessionId: 'hashed' },
        purpose: 'Balance analytics data across sharded cluster'
      }
    ];

    for (const hashedIndex of hashedIndexes) {
      const collection = this.db.collection(hashedIndex.collection);

      await collection.createIndex(hashedIndex.spec, {
        name: hashedIndex.name,
        background: true
      });

      console.log(`Created hashed index: ${hashedIndex.name}`);
    }
  }

  async implementMultikeyIndexOptimization() {
    console.log('Implementing multikey index optimization for arrays...');

    // Optimized indexes for array fields
    const multikeyIndexes = [
      {
        name: 'idx_users_tags_interests',
        collection: 'users',
        spec: { tags: 1, 'interests.category': 1 },
        purpose: 'User segmentation by tags and interest categories'
      },
      {
        name: 'idx_products_categories_brands',
        collection: 'products',
        spec: { categories: 1, brand: 1, status: 1 },
        purpose: 'Product catalog queries with category and brand filtering'
      },
      {
        name: 'idx_orders_product_items',
        collection: 'orders',
        spec: { 'items.productId': 1, 'items.category': 1, status: 1 },
        purpose: 'Product performance analysis across orders'
      }
    ];

    for (const multikeyIndex of multikeyIndexes) {
      const collection = this.db.collection(multikeyIndex.collection);

      // Check if index involves multiple array fields (compound multikey limitation)
      const sampleDoc = await collection.findOne();
      const arrayFields = this.identifyArrayFields(sampleDoc, Object.keys(multikeyIndex.spec));

      if (arrayFields.length > 1) {
        console.warn(`Index ${multikeyIndex.name} may have compound multikey limitations`);
        // Create alternative single-array indexes
        for (const arrayField of arrayFields) {
          const alternativeSpec = { [arrayField]: 1 };
          await collection.createIndex(alternativeSpec, {
            name: `${multikeyIndex.name}_${arrayField}`,
            background: true
          });
        }
      } else {
        await collection.createIndex(multikeyIndex.spec, {
          name: multikeyIndex.name,
          background: true
        });
      }

      console.log(`Created multikey index: ${multikeyIndex.name}`);
    }
  }

  identifyArrayFields(document, fieldNames) {
    const arrayFields = [];

    for (const fieldName of fieldNames) {
      const value = this.getNestedValue(document, fieldName);
      if (Array.isArray(value)) {
        arrayFields.push(fieldName);
      }
    }

    return arrayFields;
  }

  getNestedValue(obj, path) {
    return path.split('.').reduce((current, key) => current?.[key], obj);
  }

  async implementIndexIntersectionStrategies() {
    console.log('Implementing index intersection strategies...');

    // Design indexes that work well together for intersection
    const intersectionIndexes = [
      {
        name: 'idx_users_status_single',
        collection: 'users',
        spec: { status: 1 },
        purpose: 'Status filtering for intersection'
      },
      {
        name: 'idx_users_country_single',
        collection: 'users', 
        spec: { country: 1 },
        purpose: 'Geographic filtering for intersection'
      },
      {
        name: 'idx_users_activity_single',
        collection: 'users',
        spec: { lastLoginAt: -1 },
        purpose: 'Activity-based filtering for intersection'
      },
      {
        name: 'idx_users_spending_single',
        collection: 'users',
        spec: { totalSpent: -1 },
        purpose: 'Spending analysis for intersection'
      }
    ];

    // Create single-field indexes that can be intersected
    for (const index of intersectionIndexes) {
      const collection = this.db.collection(index.collection);

      await collection.createIndex(index.spec, {
        name: index.name,
        background: true
      });

      console.log(`Created intersection index: ${index.name}`);
    }

    // Test intersection performance
    await this.testIndexIntersectionPerformance();
  }

  async testIndexIntersectionPerformance() {
    console.log('Testing index intersection performance...');

    const collection = this.db.collection('users');

    // Query that should use index intersection
    const intersectionQuery = {
      status: 'active',
      country: 'US', 
      lastLoginAt: { $gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) },
      totalSpent: { $gte: 100 }
    };

    const explain = await collection.find(intersectionQuery).explain('executionStats');

    if (explain.executionStats.executionStages.stage === 'AND_HASH' ||
        explain.executionStats.executionStages.stage === 'AND_SORTED') {
      console.log('✅ Query successfully using index intersection');
      console.log(`Execution time: ${explain.executionStats.executionTimeMillis}ms`);
    } else {
      console.log('❌ Query not using index intersection, consider compound index');
      console.log(`Current stage: ${explain.executionStats.executionStages.stage}`);
    }
  }

  async implementTimesSeriesIndexing() {
    console.log('Implementing time-series optimized indexing...');

    const timeSeriesIndexes = [
      {
        name: 'idx_metrics_time_metric',
        collection: 'metrics',
        spec: { timestamp: 1, metricType: 1, value: 1 },
        purpose: 'Time-series metrics queries with metric type filtering'
      },
      {
        name: 'idx_events_time_user',
        collection: 'events',
        spec: { timestamp: 1, userId: 1, eventType: 1 },
        purpose: 'User activity timeline and event analysis'
      },
      {
        name: 'idx_logs_time_level',
        collection: 'logs', 
        spec: { timestamp: 1, level: 1, service: 1 },
        purpose: 'Log analysis with severity and service filtering'
      }
    ];

    for (const tsIndex of timeSeriesIndexes) {
      const collection = this.db.collection(tsIndex.collection);

      await collection.createIndex(tsIndex.spec, {
        name: tsIndex.name,
        background: true
      });

      console.log(`Created time-series index: ${tsIndex.name}`);
    }

    // Create time-based partial indexes for recent data
    const recentDataIndexes = [
      {
        name: 'idx_metrics_recent_hot',
        collection: 'metrics',
        spec: { timestamp: 1, metricType: 1, userId: 1 },
        filter: { 
          timestamp: { $gte: new Date(Date.now() - 7 * 24 * 60 * 60 * 1000) }
        },
        purpose: 'Hot data access for recent metrics (last 7 days)'
      },
      {
        name: 'idx_events_recent_active',
        collection: 'events',
        spec: { userId: 1, eventType: 1, timestamp: -1 },
        filter: {
          timestamp: { $gte: new Date(Date.now() - 24 * 60 * 60 * 1000) }
        },
        purpose: 'Recent user activity (last 24 hours)'
      }
    ];

    for (const recentIndex of recentDataIndexes) {
      const collection = this.db.collection(recentIndex.collection);

      await collection.createIndex(recentIndex.spec, {
        name: recentIndex.name,
        partialFilterExpression: recentIndex.filter,
        background: true
      });

      console.log(`Created recent data index: ${recentIndex.name}`);
    }
  }

  async monitorIndexPerformanceMetrics() {
    console.log('Monitoring index performance metrics...');

    const performanceMetrics = {
      collections: {},
      globalMetrics: {},
      recommendations: []
    };

    for (const collectionName of ['users', 'orders', 'products', 'analytics']) {
      const collection = this.db.collection(collectionName);

      try {
        // Get collection statistics
        const stats = await collection.stats();
        const indexStats = await collection.aggregate([{ $indexStats: {} }]).toArray();

        performanceMetrics.collections[collectionName] = {
          documentCount: stats.count,
          avgDocumentSize: stats.avgObjSize,
          dataSize: stats.size,
          indexCount: stats.nindexes,
          totalIndexSize: stats.totalIndexSize,
          indexSizeRatio: (stats.totalIndexSize / stats.size).toFixed(3),
          indexes: indexStats.map(stat => ({
            name: stat.name,
            size: stat.size,
            usageCount: stat.accesses?.ops || 0,
            lastUsed: stat.accesses?.since,
            efficiency: this.calculateIndexEfficiency(stat, stats)
          }))
        };

        // Generate recommendations
        const collectionRecommendations = this.generateCollectionIndexRecommendations(
          collectionName, 
          performanceMetrics.collections[collectionName]
        );
        performanceMetrics.recommendations.push(...collectionRecommendations);

      } catch (error) {
        console.warn(`Could not analyze ${collectionName}:`, error.message);
      }
    }

    // Calculate global metrics
    const totalDataSize = Object.values(performanceMetrics.collections)
      .reduce((sum, col) => sum + col.dataSize, 0);
    const totalIndexSize = Object.values(performanceMetrics.collections)
      .reduce((sum, col) => sum + col.totalIndexSize, 0);

    performanceMetrics.globalMetrics = {
      totalDataSize,
      totalIndexSize,
      globalIndexRatio: (totalIndexSize / totalDataSize).toFixed(3),
      totalIndexCount: Object.values(performanceMetrics.collections)
        .reduce((sum, col) => sum + col.indexCount, 0),
      avgIndexEfficiency: this.calculateAverageIndexEfficiency(performanceMetrics.collections)
    };

    console.log('Index performance monitoring completed');
    console.log(`Global index ratio: ${performanceMetrics.globalMetrics.globalIndexRatio}`);
    console.log(`Total indexes: ${performanceMetrics.globalMetrics.totalIndexCount}`);
    console.log(`Recommendations generated: ${performanceMetrics.recommendations.length}`);

    return performanceMetrics;
  }

  calculateIndexEfficiency(indexStat, collectionStats) {
    const usagePerMB = (indexStat.accesses?.ops || 0) / Math.max(indexStat.size / (1024 * 1024), 0.1);
    const sizeRatio = indexStat.size / collectionStats.size;
    const daysSinceLastUse = indexStat.accesses?.since ? 
      (Date.now() - indexStat.accesses.since) / (24 * 60 * 60 * 1000) : 999;

    // Efficiency score: usage frequency weighted by size efficiency and recency
    const efficiencyScore = (usagePerMB * 0.5) + 
                           ((1 - sizeRatio) * 50 * 0.3) + 
                           (Math.max(0, 30 - daysSinceLastUse) * 0.2);

    return Math.round(efficiencyScore * 100) / 100;
  }

  calculateAverageIndexEfficiency(collections) {
    let totalEfficiency = 0;
    let indexCount = 0;

    for (const collection of Object.values(collections)) {
      for (const index of collection.indexes) {
        if (index.name !== '_id_') { // Exclude default _id index
          totalEfficiency += index.efficiency;
          indexCount++;
        }
      }
    }

    return indexCount > 0 ? (totalEfficiency / indexCount).toFixed(2) : 0;
  }

  generateCollectionIndexRecommendations(collectionName, collectionData) {
    const recommendations = [];

    // Check for high index-to-data ratio
    if (parseFloat(collectionData.indexSizeRatio) > this.performanceTargets.maxIndexSizeRatio) {
      recommendations.push({
        collection: collectionName,
        type: 'SIZE_WARNING',
        message: `Index size ratio (${collectionData.indexSizeRatio}) exceeds recommended threshold`,
        suggestion: 'Review index necessity and consider partial indexes'
      });
    }

    // Check for unused indexes
    const unusedIndexes = collectionData.indexes.filter(idx => 
      idx.name !== '_id_' && idx.usageCount === 0
    );

    if (unusedIndexes.length > 0) {
      recommendations.push({
        collection: collectionName,
        type: 'UNUSED_INDEXES',
        message: `Found ${unusedIndexes.length} unused indexes`,
        suggestion: `Consider dropping: ${unusedIndexes.map(idx => idx.name).join(', ')}`
      });
    }

    // Check for low-efficiency indexes
    const inefficientIndexes = collectionData.indexes.filter(idx => 
      idx.name !== '_id_' && idx.efficiency < 1.0
    );

    if (inefficientIndexes.length > 0) {
      recommendations.push({
        collection: collectionName,
        type: 'LOW_EFFICIENCY',
        message: `Found ${inefficientIndexes.length} low-efficiency indexes`,
        suggestion: 'Review usage patterns and consider redesigning or adding partial filters'
      });
    }

    // Check for missing compound indexes (heuristic)
    if (collectionData.indexCount < 3 && collectionData.documentCount > 10000) {
      recommendations.push({
        collection: collectionName,
        type: 'MISSING_COMPOUND_INDEXES',
        message: 'Large collection with few indexes may benefit from compound indexes',
        suggestion: 'Analyze query patterns and create compound indexes for frequently combined filters'
      });
    }

    return recommendations;
  }
}

SQL-Style Index Management with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB index operations:

-- QueryLeaf index management with SQL-familiar syntax

-- Create single-field indexes
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_users_status ON users(status);
CREATE INDEX idx_users_country ON users(country);
CREATE INDEX idx_users_created_at ON users(created_at DESC); -- Descending sort

-- Create compound indexes following ESR (Equality, Sort, Range) principle
CREATE INDEX idx_users_compound_esr ON users(
  status,           -- Equality: exact match filters
  country,          -- Equality: exact match filters  
  total_spent DESC, -- Sort: ordering field
  created_at        -- Range: range queries
);

-- Create partial indexes with conditions
CREATE INDEX idx_users_active_email ON users(email)
WHERE status = 'active';

CREATE INDEX idx_users_premium_spending ON users(total_spent DESC, loyalty_points DESC)
WHERE account_type = 'premium' AND total_spent > 100;

CREATE INDEX idx_orders_recent_high_value ON orders(total_amount DESC, created_at DESC)
WHERE status = 'completed' 
  AND created_at >= CURRENT_TIMESTAMP - INTERVAL '90 days'
  AND total_amount >= 500;

-- Create text search indexes with weights
CREATE TEXT INDEX idx_users_search ON users(
  first_name WEIGHT 10,
  last_name WEIGHT 10,
  email WEIGHT 5,
  company WEIGHT 3,
  bio WEIGHT 1
) WITH (
  default_language = 'english',
  language_override = 'language'
);

CREATE TEXT INDEX idx_products_search ON products(
  name WEIGHT 20,
  brand WEIGHT 15,
  tags WEIGHT 10,
  description WEIGHT 5,
  features WEIGHT 3
);

-- Create geospatial indexes
CREATE INDEX idx_users_location ON users(location) USING GEO2DSPHERE;
CREATE INDEX idx_stores_address ON stores(address.coordinates) USING GEO2DSPHERE;

-- Create sparse indexes for optional fields
CREATE INDEX idx_users_social_profiles ON users(
  social_profiles.twitter,
  social_profiles.linkedin
) WITH SPARSE;

CREATE INDEX idx_users_subscription ON users(
  subscription.plan_id,
  subscription.expires_at
) WITH SPARSE;

-- Create TTL indexes for automatic data expiration
CREATE INDEX idx_sessions_ttl ON user_sessions(last_activity)
WITH TTL = '7 days';

CREATE INDEX idx_analytics_ttl ON analytics_events(created_at) 
WITH TTL = '30 days';

CREATE INDEX idx_password_resets_ttl ON password_resets(created_at)
WITH TTL = '24 hours';

-- Create wildcard indexes for flexible schemas
CREATE INDEX idx_users_metadata ON users("metadata.$**");
CREATE INDEX idx_products_attributes ON products("attributes.$**");
CREATE INDEX idx_orders_custom_fields ON orders("custom_fields.$**");

-- Advanced compound index patterns
WITH user_activity_analysis AS (
  SELECT 
    user_id,
    status,
    country,
    DATE_TRUNC('month', created_at) as signup_month,
    last_login_at,
    total_spent,
    loyalty_tier,

    -- User categorization
    CASE 
      WHEN total_spent > 1000 THEN 'high_value'
      WHEN total_spent > 100 THEN 'medium_value' 
      ELSE 'low_value'
    END as value_segment,

    CASE
      WHEN last_login_at >= CURRENT_TIMESTAMP - INTERVAL '7 days' THEN 'active'
      WHEN last_login_at >= CURRENT_TIMESTAMP - INTERVAL '30 days' THEN 'recent'
      WHEN last_login_at >= CURRENT_TIMESTAMP - INTERVAL '90 days' THEN 'inactive'
      ELSE 'dormant'
    END as activity_segment

  FROM users
  WHERE created_at >= CURRENT_TIMESTAMP - INTERVAL '2 years'
),

index_optimization_analysis AS (
  SELECT 
    -- Query pattern analysis for index design
    COUNT(*) as total_queries,
    COUNT(*) FILTER (WHERE status = 'active') as active_user_queries,
    COUNT(*) FILTER (WHERE country IN ('US', 'CA', 'UK')) as geographic_queries,
    COUNT(*) FILTER (WHERE total_spent > 100) as spending_queries,
    COUNT(*) FILTER (WHERE last_login_at >= CURRENT_TIMESTAMP - INTERVAL '30 days') as recent_activity_queries,

    -- Compound query patterns
    COUNT(*) FILTER (WHERE status = 'active' AND country = 'US') as status_country_queries,
    COUNT(*) FILTER (WHERE status = 'active' AND total_spent > 100) as status_spending_queries,
    COUNT(*) FILTER (WHERE country = 'US' AND total_spent > 500) as country_spending_queries,

    -- Complex filtering patterns
    COUNT(*) FILTER (
      WHERE status = 'active' 
        AND country IN ('US', 'CA') 
        AND total_spent > 100
        AND last_login_at >= CURRENT_TIMESTAMP - INTERVAL '30 days'
    ) as complex_filter_queries,

    -- Sorting patterns
    COUNT(*) FILTER (WHERE ORDER BY created_at DESC IS NOT NULL) as date_sort_queries,
    COUNT(*) FILTER (WHERE ORDER BY total_spent DESC IS NOT NULL) as spending_sort_queries,
    COUNT(*) FILTER (WHERE ORDER BY last_login_at DESC IS NOT NULL) as activity_sort_queries,

    -- Range query patterns  
    COUNT(*) FILTER (WHERE created_at >= CURRENT_TIMESTAMP - INTERVAL '1 year') as date_range_queries,
    COUNT(*) FILTER (WHERE total_spent BETWEEN 100 AND 1000) as spending_range_queries

  FROM user_activity_analysis
)

-- Optimal index recommendations based on query patterns
SELECT 
  'CREATE INDEX idx_users_status_country_spending ON users(status, country, total_spent DESC)' as recommended_index,
  'High frequency status + country + spending queries' as justification,
  status_country_queries + country_spending_queries as query_frequency
FROM index_optimization_analysis
WHERE status_country_queries > 100 OR country_spending_queries > 100

UNION ALL

SELECT 
  'CREATE INDEX idx_users_active_recent_spending ON users(status, last_login_at DESC, total_spent DESC) WHERE status = ''active''',
  'Active user analysis with recent activity and spending',
  active_user_queries + recent_activity_queries
FROM index_optimization_analysis  
WHERE active_user_queries > 50

UNION ALL

SELECT 
  'CREATE INDEX idx_users_geographic_value ON users(country, value_segment, activity_segment)',
  'Geographic segmentation with customer value analysis',
  geographic_queries
FROM index_optimization_analysis
WHERE geographic_queries > 75;

-- Index performance monitoring and optimization
WITH index_usage_stats AS (
  SELECT 
    collection_name,
    index_name,
    index_size_mb,
    usage_count,
    last_used,

    -- Calculate index efficiency metrics
    usage_count / GREATEST(index_size_mb, 1) as usage_per_mb,
    EXTRACT(DAYS FROM (CURRENT_TIMESTAMP - last_used)) as days_since_last_use,

    -- Index selectivity estimation
    CASE 
      WHEN index_name LIKE '%email%' THEN 'high'      -- Unique fields
      WHEN index_name LIKE '%status%' THEN 'low'      -- Few distinct values
      WHEN index_name LIKE '%country%' THEN 'medium'  -- Geographic distribution
      WHEN index_name LIKE '%created_at%' THEN 'high' -- Timestamp fields
      ELSE 'unknown'
    END as estimated_selectivity,

    -- Index type classification
    CASE 
      WHEN index_name LIKE '%compound%' OR index_name LIKE '%_%_%' THEN 'compound'
      WHEN index_name LIKE '%text%' OR index_name LIKE '%search%' THEN 'text'
      WHEN index_name LIKE '%geo%' OR index_name LIKE '%location%' THEN 'geospatial'
      WHEN index_name LIKE '%ttl%' THEN 'ttl'
      ELSE 'single_field'
    END as index_type

  FROM mongodb_index_stats  -- Hypothetical system table
  WHERE collection_name IN ('users', 'orders', 'products', 'analytics')
),

index_health_assessment AS (
  SELECT 
    collection_name,
    index_name,
    index_type,
    usage_per_mb,
    days_since_last_use,
    estimated_selectivity,

    -- Health score calculation
    CASE 
      WHEN days_since_last_use > 30 AND usage_count = 0 THEN 'UNUSED'
      WHEN usage_per_mb < 0.1 THEN 'LOW_EFFICIENCY' 
      WHEN usage_per_mb > 10 AND estimated_selectivity = 'high' THEN 'OPTIMAL'
      WHEN usage_per_mb > 5 AND estimated_selectivity = 'medium' THEN 'GOOD'
      WHEN usage_per_mb > 1 THEN 'ACCEPTABLE'
      ELSE 'NEEDS_REVIEW'
    END as health_status,

    -- Optimization recommendations
    CASE 
      WHEN days_since_last_use > 30 THEN 'Consider dropping unused index'
      WHEN usage_per_mb < 0.1 AND estimated_selectivity = 'low' THEN 'Add partial filter to improve selectivity'
      WHEN index_type = 'single_field' AND usage_per_mb > 5 THEN 'Consider compound index for better coverage'
      WHEN index_size_mb > 100 AND usage_per_mb < 1 THEN 'Large index with low usage - review necessity'
      ELSE 'Index performing within acceptable parameters'
    END as optimization_recommendation

  FROM index_usage_stats
)

SELECT 
  collection_name,
  index_name,
  index_type,
  health_status,
  ROUND(usage_per_mb, 2) as usage_efficiency,
  days_since_last_use,
  optimization_recommendation,

  -- Priority scoring for optimization
  CASE health_status
    WHEN 'UNUSED' THEN 100
    WHEN 'LOW_EFFICIENCY' THEN 80
    WHEN 'NEEDS_REVIEW' THEN 60
    WHEN 'ACCEPTABLE' THEN 20
    ELSE 0
  END as optimization_priority

FROM index_health_assessment
WHERE health_status != 'OPTIMAL'
ORDER BY optimization_priority DESC, collection_name, index_name;

-- Real-time query performance analysis with index recommendations
WITH slow_queries AS (
  SELECT 
    collection_name,
    query_pattern,
    avg_execution_time_ms,
    query_count,
    index_used,
    documents_examined,
    documents_returned,

    -- Calculate query efficiency metrics  
    documents_examined / GREATEST(documents_returned, 1) as scan_efficiency,
    query_count * avg_execution_time_ms as total_time_impact,

    -- Identify optimization opportunities
    CASE 
      WHEN index_used IS NULL OR index_used = 'COLLSCAN' THEN 'MISSING_INDEX'
      WHEN scan_efficiency > 100 THEN 'POOR_SELECTIVITY'
      WHEN avg_execution_time_ms > 100 THEN 'SLOW_QUERY'
      ELSE 'ACCEPTABLE'
    END as performance_issue

  FROM query_performance_log  -- Hypothetical query log table
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
    AND avg_execution_time_ms > 50
),

index_recommendations AS (
  SELECT 
    collection_name,
    query_pattern,
    performance_issue,
    total_time_impact,

    -- Generate specific index recommendations
    CASE performance_issue
      WHEN 'MISSING_INDEX' THEN 
        'CREATE INDEX ON ' || collection_name || ' FOR: ' || query_pattern
      WHEN 'POOR_SELECTIVITY' THEN
        'CREATE PARTIAL INDEX ON ' || collection_name || ' WITH SELECTIVE FILTER'  
      WHEN 'SLOW_QUERY' THEN
        'OPTIMIZE INDEX ON ' || collection_name || ' FOR QUERY: ' || query_pattern
      ELSE 'No immediate action required'
    END as recommended_action,

    -- Estimate performance improvement
    CASE performance_issue
      WHEN 'MISSING_INDEX' THEN LEAST(avg_execution_time_ms * 0.8, 50) -- 80% improvement
      WHEN 'POOR_SELECTIVITY' THEN LEAST(avg_execution_time_ms * 0.6, 30) -- 60% improvement  
      WHEN 'SLOW_QUERY' THEN LEAST(avg_execution_time_ms * 0.4, 20) -- 40% improvement
      ELSE 0
    END as estimated_improvement_ms

  FROM slow_queries
  WHERE performance_issue != 'ACCEPTABLE'
)

SELECT 
  collection_name,
  recommended_action,
  COUNT(*) as affected_query_patterns,
  SUM(total_time_impact) as total_performance_impact,
  ROUND(AVG(estimated_improvement_ms), 1) as avg_improvement_ms,

  -- Calculate ROI for optimization effort
  ROUND(SUM(total_time_impact * estimated_improvement_ms / 1000), 2) as optimization_value_score,

  -- Priority ranking
  ROW_NUMBER() OVER (ORDER BY SUM(total_time_impact) DESC) as optimization_priority

FROM index_recommendations  
GROUP BY collection_name, recommended_action
HAVING COUNT(*) >= 3  -- Focus on patterns affecting multiple queries
ORDER BY optimization_priority ASC;

-- QueryLeaf provides comprehensive index management capabilities:
-- 1. SQL-familiar index creation syntax with advanced options
-- 2. Partial indexes with complex conditional expressions  
-- 3. Text search indexes with customizable weights and language support
-- 4. Geospatial indexing for location-based queries and analysis
-- 5. TTL indexes with flexible expiration rules and time units
-- 6. Compound index optimization following ESR principles
-- 7. Real-time index performance monitoring and health assessment
-- 8. Automated index recommendations based on query patterns
-- 9. Index usage analytics and optimization priority scoring
-- 10. Integration with MongoDB's native indexing optimizations

Best Practices for MongoDB Index Implementation

Index Design Guidelines

Essential principles for optimal MongoDB index design:

  1. ESR Rule: Design compound indexes following Equality, Sort, Range field ordering
  2. Selectivity Focus: Prioritize high-selectivity fields early in compound indexes
  3. Query Pattern Analysis: Design indexes based on actual application query patterns
  4. Partial Index Usage: Use partial indexes to reduce size and improve performance
  5. Index Intersection: Consider single-field indexes that can be intersected efficiently
  6. Covered Queries: Design indexes to cover frequently executed queries entirely

Performance and Maintenance

Optimize MongoDB indexes for production workloads:

  1. Regular Monitoring: Implement continuous index usage and performance monitoring
  2. Size Management: Keep total index size reasonable relative to data size
  3. Background Building: Always build indexes in background for production systems
  4. Usage Analysis: Regularly review and remove unused or inefficient indexes
  5. Testing Strategy: Test index changes thoroughly before production deployment
  6. Documentation: Maintain clear documentation of index purpose and query patterns

Conclusion

MongoDB's advanced indexing capabilities provide comprehensive optimization strategies that eliminate the limitations and constraints of traditional relational database indexing approaches. The flexible indexing system supports complex document structures, dynamic schemas, and specialized data types while delivering exceptional query performance at scale.

Key MongoDB Indexing benefits include:

  • Flexible Index Types: Support for compound, partial, text, geospatial, sparse, TTL, and wildcard indexes
  • Advanced Query Optimization: Sophisticated query planner with index intersection and covered query support
  • Dynamic Schema Support: Indexing capabilities that adapt to evolving document structures
  • Specialized Data Support: Native indexing for arrays, embedded documents, and geospatial data
  • Performance Analytics: Comprehensive index usage monitoring and optimization recommendations
  • Scalable Architecture: Index strategies that work across replica sets and sharded clusters

Whether you're optimizing query performance, managing large-scale data operations, or building applications with complex data access patterns, MongoDB's indexing system with QueryLeaf's familiar SQL interface provides the foundation for high-performance database operations.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB indexing operations while providing SQL-familiar index creation, optimization, and monitoring syntax. Advanced indexing patterns, performance analysis, and automated recommendations are seamlessly handled through familiar SQL constructs, making sophisticated database optimization both powerful and accessible to SQL-oriented development teams.

The integration of native MongoDB indexing capabilities with SQL-style operations makes MongoDB an ideal platform for applications requiring both flexible data modeling and familiar database optimization patterns, ensuring your applications achieve optimal performance while remaining maintainable as they scale and evolve.

MongoDB Vector Search for Semantic Applications: Building AI-Powered Search with SQL-Style Vector Operations

Modern applications increasingly require intelligent search capabilities that understand semantic meaning rather than just keyword matching. Traditional text-based search approaches struggle with understanding context, handling synonyms, and providing relevant results for complex queries that require conceptual understanding rather than exact text matches.

MongoDB Atlas Vector Search provides native vector database capabilities that enable semantic similarity search, recommendation systems, and retrieval-augmented generation (RAG) applications. Unlike standalone vector databases that require separate infrastructure, Atlas Vector Search integrates seamlessly with MongoDB's document model, allowing developers to combine traditional database operations with advanced AI-powered search in a single, unified platform.

The Traditional Search Limitations Challenge

Conventional approaches to search and content discovery have significant limitations for modern intelligent applications:

-- Traditional relational search - limited semantic understanding

-- PostgreSQL full-text search with performance and relevance challenges
CREATE TABLE documents (
  document_id SERIAL PRIMARY KEY,
  title VARCHAR(500) NOT NULL,
  content TEXT NOT NULL,
  category VARCHAR(100),
  tags TEXT[],
  author VARCHAR(200),
  created_at TIMESTAMP DEFAULT NOW(),

  -- Full-text search vector (keyword-based only)
  search_vector tsvector GENERATED ALWAYS AS (
    setweight(to_tsvector('english', title), 'A') ||
    setweight(to_tsvector('english', content), 'B') ||
    setweight(to_tsvector('english', array_to_string(tags, ' ')), 'C')
  ) STORED
);

-- Create full-text search index
CREATE INDEX idx_documents_fts ON documents USING GIN(search_vector);

-- Additional indexes for filtering
CREATE INDEX idx_documents_category ON documents(category);
CREATE INDEX idx_documents_created_at ON documents(created_at DESC);
CREATE INDEX idx_documents_author ON documents(author);

-- Traditional keyword-based search with limited semantic understanding
WITH search_query AS (
  SELECT 
    document_id,
    title,
    content,
    category,
    author,
    created_at,

    -- Basic relevance scoring (keyword-based only)
    ts_rank_cd(search_vector, plainto_tsquery('english', 'machine learning algorithms')) as relevance_score,

    -- Highlight matching text
    ts_headline('english', content, plainto_tsquery('english', 'machine learning algorithms'), 
                'MaxWords=50, MinWords=20, ShortWord=3, HighlightAll=false') as highlighted_content,

    -- Basic similarity using trigram matching (very limited)
    similarity(title, 'machine learning algorithms') as title_similarity,

    -- Category boosting (manual relevance adjustment)
    CASE category 
      WHEN 'AI' THEN 1.5 
      WHEN 'Technology' THEN 1.2 
      ELSE 1.0 
    END as category_boost

  FROM documents
  WHERE search_vector @@ plainto_tsquery('english', 'machine learning algorithms')
     OR similarity(title, 'machine learning algorithms') > 0.1
),

ranked_results AS (
  SELECT 
    *,
    -- Combined relevance scoring (still keyword-dependent)
    (relevance_score * category_boost * 
     CASE WHEN title_similarity > 0.3 THEN 2.0 ELSE 1.0 END) as final_score,

    -- Manual semantic grouping (limited effectiveness)
    CASE 
      WHEN content ILIKE '%neural network%' OR content ILIKE '%deep learning%' THEN 'Deep Learning'
      WHEN content ILIKE '%statistics%' OR content ILIKE '%data science%' THEN 'Data Science' 
      WHEN content ILIKE '%algorithm%' OR content ILIKE '%optimization%' THEN 'Algorithms'
      ELSE 'General'
    END as semantic_category,

    -- Time decay factor
    CASE 
      WHEN created_at >= NOW() - INTERVAL '30 days' THEN 1.2
      WHEN created_at >= NOW() - INTERVAL '90 days' THEN 1.0
      WHEN created_at >= NOW() - INTERVAL '1 year' THEN 0.8
      ELSE 0.6
    END as recency_boost

  FROM search_query
  WHERE relevance_score > 0.01
),

related_documents AS (
  -- Attempt to find related documents (very basic approach)
  SELECT DISTINCT
    r1.document_id,
    r2.document_id as related_id,
    r2.title as related_title,

    -- Basic relatedness calculation
    (array_length(array(SELECT UNNEST(r1.tags) INTERSECT SELECT UNNEST(r2.tags)), 1) / 
     GREATEST(array_length(r1.tags, 1), array_length(r2.tags, 1))::numeric) as tag_similarity,

    CASE WHEN r1.category = r2.category THEN 0.3 ELSE 0 END as category_match,
    CASE WHEN r1.author = r2.author THEN 0.2 ELSE 0 END as author_match

  FROM ranked_results r1
  JOIN documents r2 ON r1.document_id != r2.document_id
  WHERE r1.final_score > 0.5
),

final_results AS (
  SELECT 
    r.document_id,
    r.title,
    LEFT(r.content, 200) || '...' as content_preview,
    r.highlighted_content,
    r.category,
    r.semantic_category,
    r.author,
    r.created_at,

    -- Final ranking with all factors
    ROUND((r.final_score * r.recency_boost)::numeric, 4) as final_relevance_score,

    -- Related documents (limited by keyword overlap)
    COALESCE(
      (SELECT json_agg(json_build_object(
        'id', related_id,
        'title', related_title,
        'similarity', ROUND((tag_similarity + category_match + author_match)::numeric, 3)
      )) FROM related_documents rd 
       WHERE rd.document_id = r.document_id 
         AND (tag_similarity + category_match + author_match) > 0.1
       LIMIT 5),
      '[]'::json
    ) as related_documents

  FROM ranked_results r
)

SELECT 
  document_id,
  title,
  content_preview,
  highlighted_content,
  category,
  semantic_category,
  author,
  final_relevance_score,
  related_documents,

  -- Search result metadata
  COUNT(*) OVER () as total_results,
  ROW_NUMBER() OVER (ORDER BY final_relevance_score DESC) as result_rank

FROM final_results
ORDER BY final_relevance_score DESC, created_at DESC
LIMIT 20;

-- Problems with traditional keyword-based search:
-- 1. No understanding of semantic meaning or context
-- 2. Cannot handle synonyms, related concepts, or conceptual queries
-- 3. Limited relevance scoring based only on keyword frequency and position  
-- 4. Poor handling of multilingual content and cross-language search
-- 5. No support for similarity search across different content types
-- 6. Manual and error-prone relevance tuning with limited effectiveness
-- 7. Cannot understand user intent beyond explicit keyword matches
-- 8. Poor recommendation capabilities based only on metadata overlap
-- 9. Limited support for complex search patterns and AI-powered features
-- 10. No integration with modern machine learning and embedding models

-- MySQL approach (even more limited)
SELECT 
  document_id,
  title,
  content,
  category,

  -- Basic full-text search (MySQL limitations)
  MATCH(title, content) AGAINST('machine learning' IN NATURAL LANGUAGE MODE) as relevance,

  -- Simple keyword highlighting
  REPLACE(
    REPLACE(title, 'machine', '<mark>machine</mark>'), 
    'learning', '<mark>learning</mark>'
  ) as highlighted_title

FROM mysql_documents
WHERE MATCH(title, content) AGAINST('machine learning' IN NATURAL LANGUAGE MODE)
ORDER BY relevance DESC
LIMIT 10;

-- MySQL limitations:
-- - Very basic full-text search with limited relevance algorithms
-- - No semantic understanding or contextual matching
-- - Limited text processing and language support
-- - Basic relevance scoring without advanced ranking factors
-- - No support for vector embeddings or similarity search
-- - Limited customization of search behavior and ranking
-- - Poor performance with large text corpuses
-- - No integration with modern AI/ML search techniques

MongoDB Atlas Vector Search provides intelligent semantic search capabilities:

// MongoDB Atlas Vector Search - AI-powered semantic search and similarity matching
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb+srv://your-cluster.mongodb.net/');
const db = client.db('intelligent_search_platform');

// Advanced vector search and semantic similarity platform
class VectorSearchManager {
  constructor(db) {
    this.db = db;
    this.collections = {
      documents: db.collection('documents'),
      vectorIndex: db.collection('vector_index_metadata'),
      searchAnalytics: db.collection('search_analytics'),
      userProfiles: db.collection('user_profiles'),
      recommendations: db.collection('recommendations')
    };

    // Vector search configuration
    this.vectorConfig = {
      dimensions: 1536, // OpenAI text-embedding-ada-002
      similarity: 'cosine',
      indexType: 'knnVector'
    };

    this.embeddingModel = 'text-embedding-ada-002'; // Can be configured for different models
  }

  async initializeVectorSearchIndexes() {
    console.log('Initializing Atlas Vector Search indexes...');

    // Create vector search index for document content
    const contentVectorIndex = {
      name: 'content_vector_index',
      definition: {
        fields: [
          {
            type: 'vector',
            path: 'contentVector',
            numDimensions: this.vectorConfig.dimensions,
            similarity: this.vectorConfig.similarity
          },
          {
            type: 'filter',
            path: 'category'
          },
          {
            type: 'filter', 
            path: 'tags'
          },
          {
            type: 'filter',
            path: 'publishedDate'
          },
          {
            type: 'filter',
            path: 'author'
          },
          {
            type: 'filter',
            path: 'contentType'
          }
        ]
      }
    };

    // Create vector search index for title embeddings
    const titleVectorIndex = {
      name: 'title_vector_index', 
      definition: {
        fields: [
          {
            type: 'vector',
            path: 'titleVector',
            numDimensions: this.vectorConfig.dimensions,
            similarity: this.vectorConfig.similarity
          }
        ]
      }
    };

    // Create hybrid search index combining vector and text search
    const hybridSearchIndex = {
      name: 'hybrid_search_index',
      definition: {
        fields: [
          {
            type: 'vector',
            path: 'contentVector',
            numDimensions: this.vectorConfig.dimensions,
            similarity: this.vectorConfig.similarity
          },
          {
            type: 'autocomplete',
            path: 'title',
            tokenization: 'edgeGram',
            minGrams: 2,
            maxGrams: 15
          },
          {
            type: 'text',
            path: 'content',
            analyzer: 'lucene.standard'
          },
          {
            type: 'text',
            path: 'tags',
            analyzer: 'lucene.keyword'
          }
        ]
      }
    };

    try {
      // Note: In practice, vector search indexes are created through MongoDB Atlas UI
      // or MongoDB CLI. This code shows the structure for reference.
      console.log('Vector search indexes configured:');
      console.log('- Content Vector Index:', contentVectorIndex.name);
      console.log('- Title Vector Index:', titleVectorIndex.name); 
      console.log('- Hybrid Search Index:', hybridSearchIndex.name);

      // Store index metadata for application reference
      await this.collections.vectorIndex.insertMany([
        { ...contentVectorIndex, createdAt: new Date(), status: 'active' },
        { ...titleVectorIndex, createdAt: new Date(), status: 'active' },
        { ...hybridSearchIndex, createdAt: new Date(), status: 'active' }
      ]);

      return {
        contentVectorIndex: contentVectorIndex.name,
        titleVectorIndex: titleVectorIndex.name,
        hybridSearchIndex: hybridSearchIndex.name
      };

    } catch (error) {
      console.error('Vector index initialization failed:', error);
      throw error;
    }
  }

  async ingestDocumentsWithVectorization(documents) {
    console.log(`Processing ${documents.length} documents for vector search ingestion...`);

    const processedDocuments = [];
    const batchSize = 10;

    // Process documents in batches to manage API rate limits
    for (let i = 0; i < documents.length; i += batchSize) {
      const batch = documents.slice(i, i + batchSize);

      console.log(`Processing batch ${Math.floor(i / batchSize) + 1}/${Math.ceil(documents.length / batchSize)}`);

      const batchPromises = batch.map(async (doc) => {
        try {
          // Generate embeddings for title and content
          const [titleEmbedding, contentEmbedding] = await Promise.all([
            this.generateEmbedding(doc.title),
            this.generateEmbedding(doc.content)
          ]);

          // Extract key phrases and entities for enhanced searchability
          const extractedEntities = await this.extractEntities(doc.content);
          const keyPhrases = await this.extractKeyPhrases(doc.content);

          // Calculate content characteristics for better matching
          const contentCharacteristics = this.analyzeContentCharacteristics(doc.content);

          return {
            _id: doc._id || new ObjectId(),

            // Original document content
            title: doc.title,
            content: doc.content,
            summary: doc.summary || this.generateSummary(doc.content),

            // Document metadata
            category: doc.category,
            tags: doc.tags || [],
            author: doc.author,
            publishedDate: doc.publishedDate || new Date(),
            contentType: doc.contentType || 'article',
            language: doc.language || 'en',

            // Vector embeddings for semantic search
            titleVector: titleEmbedding,
            contentVector: contentEmbedding,

            // Enhanced searchability features
            entities: extractedEntities,
            keyPhrases: keyPhrases,
            contentCharacteristics: contentCharacteristics,

            // Search optimization metadata
            searchMetadata: {
              wordCount: doc.content.split(/\s+/).length,
              readingTime: Math.ceil(doc.content.split(/\s+/).length / 200), // minutes
              complexity: contentCharacteristics.complexity,
              topicDistribution: contentCharacteristics.topics,
              sentimentScore: contentCharacteristics.sentiment
            },

            // Document quality and authority signals
            qualitySignals: {
              authorityScore: doc.authorityScore || 0.5,
              freshnessScore: this.calculateFreshnessScore(doc.publishedDate || new Date()),
              engagementScore: doc.engagementScore || 0.5,
              accuracyScore: doc.accuracyScore || 0.8
            },

            // Indexing and processing metadata
            indexed: true,
            indexedAt: new Date(),
            vectorModelVersion: this.embeddingModel,
            processingVersion: '1.0'
          };

        } catch (error) {
          console.error(`Failed to process document ${doc._id}:`, error);
          return null;
        }
      });

      const batchResults = await Promise.all(batchPromises);
      const validResults = batchResults.filter(result => result !== null);
      processedDocuments.push(...validResults);

      // Rate limiting pause between batches
      if (i + batchSize < documents.length) {
        await new Promise(resolve => setTimeout(resolve, 1000));
      }
    }

    // Bulk insert processed documents
    if (processedDocuments.length > 0) {
      const insertResult = await this.collections.documents.insertMany(processedDocuments, {
        ordered: false
      });

      console.log(`Successfully indexed ${insertResult.insertedCount} documents with vector embeddings`);

      return {
        totalProcessed: documents.length,
        successfullyIndexed: insertResult.insertedCount,
        failed: documents.length - processedDocuments.length,
        indexedDocuments: processedDocuments
      };
    }

    return {
      totalProcessed: documents.length,
      successfullyIndexed: 0,
      failed: documents.length,
      indexedDocuments: []
    };
  }

  async performSemanticSearch(query, options = {}) {
    console.log(`Performing semantic search for: "${query}"`);

    const {
      limit = 20,
      filters = {},
      includeScore = true,
      similarityThreshold = 0.7,
      searchType = 'semantic', // 'semantic', 'hybrid', 'keyword'
      userContext = null
    } = options;

    try {
      // Generate query embedding for semantic search
      const queryEmbedding = await this.generateEmbedding(query);

      let pipeline = [];

      if (searchType === 'semantic' || searchType === 'hybrid') {
        // Vector similarity search stage
        pipeline.push({
          $vectorSearch: {
            index: 'content_vector_index',
            path: 'contentVector',
            queryVector: queryEmbedding,
            numCandidates: limit * 10, // Search more candidates for better results
            limit: limit * 2, // Get more results for reranking
            filter: this.buildFilterExpression(filters)
          }
        });

        // Add vector search score
        pipeline.push({
          $addFields: {
            vectorScore: { $meta: 'vectorSearchScore' },
            searchMethod: 'vector'
          }
        });
      }

      if (searchType === 'hybrid') {
        // Combine with text search for hybrid approach
        pipeline.push({
          $unionWith: {
            coll: 'documents',
            pipeline: [
              {
                $search: {
                  index: 'hybrid_search_index',
                  compound: {
                    should: [
                      {
                        text: {
                          query: query,
                          path: ['title', 'content'],
                          score: { boost: { value: 2.0 } }
                        }
                      },
                      {
                        autocomplete: {
                          query: query,
                          path: 'title',
                          score: { boost: { value: 1.5 } }
                        }
                      }
                    ],
                    filter: this.buildSearchFilterClauses(filters)
                  }
                }
              },
              {
                $addFields: {
                  textScore: { $meta: 'searchScore' },
                  searchMethod: 'text'
                }
              },
              { $limit: limit }
            ]
          }
        });
      }

      // Enhanced result processing and ranking
      pipeline.push({
        $addFields: {
          // Calculate comprehensive relevance score
          relevanceScore: {
            $switch: {
              branches: [
                {
                  case: { $eq: ['$searchMethod', 'vector'] },
                  then: {
                    $multiply: [
                      { $ifNull: ['$vectorScore', 0] },
                      { $add: [
                        { $multiply: [{ $ifNull: ['$qualitySignals.authorityScore', 0.5] }, 0.2] },
                        { $multiply: [{ $ifNull: ['$qualitySignals.freshnessScore', 0.5] }, 0.1] },
                        { $multiply: [{ $ifNull: ['$qualitySignals.engagementScore', 0.5] }, 0.15] },
                        0.55 // Base score weight
                      ]}
                    ]
                  }
                },
                {
                  case: { $eq: ['$searchMethod', 'text'] },
                  then: {
                    $multiply: [
                      { $ifNull: ['$textScore', 0] },
                      0.8 // Weight text search lower than semantic
                    ]
                  }
                }
              ],
              default: 0
            }
          },

          // Extract relevant snippets
          contentSnippet: {
            $substrCP: [
              '$content', 
              0, 
              300
            ]
          },

          // Calculate query-document semantic similarity
          semanticRelevance: {
            $cond: {
              if: { $gt: [{ $ifNull: ['$vectorScore', 0] }, similarityThreshold] },
              then: 'high',
              else: {
                $cond: {
                  if: { $gt: [{ $ifNull: ['$vectorScore', 0] }, similarityThreshold * 0.8] },
                  then: 'medium',
                  else: 'low'
                }
              }
            }
          }
        }
      });

      // User personalization if context provided
      if (userContext) {
        pipeline.push({
          $addFields: {
            personalizedScore: {
              $multiply: [
                '$relevanceScore',
                {
                  $add: [
                    // Category preference boost
                    {
                      $cond: {
                        if: { $in: ['$category', userContext.preferredCategories || []] },
                        then: 0.2,
                        else: 0
                      }
                    },
                    // Author preference boost  
                    {
                      $cond: {
                        if: { $in: ['$author', userContext.followedAuthors || []] },
                        then: 0.15,
                        else: 0
                      }
                    },
                    // Language preference
                    {
                      $cond: {
                        if: { $eq: ['$language', userContext.preferredLanguage || 'en'] },
                        then: 0.1,
                        else: -0.05
                      }
                    },
                    1.0 // Base multiplier
                  ]
                }
              ]
            }
          }
        });
      }

      // Filter by similarity threshold and finalize results
      pipeline.push(
        {
          $match: {
            relevanceScore: { $gte: similarityThreshold * 0.5 }
          }
        },
        {
          $sort: {
            [userContext ? 'personalizedScore' : 'relevanceScore']: -1,
            publishedDate: -1
          }
        },
        {
          $limit: limit
        },
        {
          $project: {
            _id: 1,
            title: 1,
            contentSnippet: 1,
            category: 1,
            tags: 1,
            author: 1,
            publishedDate: 1,
            contentType: 1,
            language: 1,
            entities: 1,
            keyPhrases: 1,
            searchMetadata: 1,
            relevanceScore: includeScore ? 1 : 0,
            personalizedScore: (includeScore && userContext) ? 1 : 0,
            vectorScore: includeScore ? 1 : 0,
            textScore: includeScore ? 1 : 0,
            semanticRelevance: 1,
            searchMethod: 1
          }
        }
      );

      const searchStart = Date.now();
      const results = await this.collections.documents.aggregate(pipeline).toArray();
      const searchTime = Date.now() - searchStart;

      // Log search analytics
      await this.logSearchAnalytics({
        query: query,
        searchType: searchType,
        filters: filters,
        resultCount: results.length,
        searchTime: searchTime,
        userContext: userContext,
        timestamp: new Date()
      });

      console.log(`Semantic search completed in ${searchTime}ms, found ${results.length} results`);

      return {
        query: query,
        searchType: searchType,
        results: results,
        metadata: {
          totalResults: results.length,
          searchTime: searchTime,
          similarityThreshold: similarityThreshold,
          filtersApplied: Object.keys(filters).length > 0
        }
      };

    } catch (error) {
      console.error('Semantic search failed:', error);
      throw error;
    }
  }

  async findSimilarDocuments(documentId, options = {}) {
    console.log(`Finding documents similar to: ${documentId}`);

    const {
      limit = 10,
      similarityThreshold = 0.75,
      excludeCategories = [],
      includeScore = true
    } = options;

    // Get the source document and its vector
    const sourceDocument = await this.collections.documents.findOne(
      { _id: documentId },
      { projection: { contentVector: 1, title: 1, category: 1, tags: 1 } }
    );

    if (!sourceDocument || !sourceDocument.contentVector) {
      throw new Error('Source document not found or not vectorized');
    }

    // Find similar documents using vector search
    const pipeline = [
      {
        $vectorSearch: {
          index: 'content_vector_index',
          path: 'contentVector',
          queryVector: sourceDocument.contentVector,
          numCandidates: limit * 20,
          limit: limit * 2,
          filter: {
            $and: [
              { _id: { $ne: documentId } }, // Exclude source document
              excludeCategories.length > 0 ? 
                { category: { $not: { $in: excludeCategories } } } : 
                {}
            ]
          }
        }
      },
      {
        $addFields: {
          similarityScore: { $meta: 'vectorSearchScore' },

          // Calculate additional similarity factors
          tagSimilarity: {
            $let: {
              vars: {
                commonTags: {
                  $size: {
                    $setIntersection: ['$tags', sourceDocument.tags || []]
                  }
                },
                totalTags: {
                  $add: [
                    { $size: { $ifNull: ['$tags', []] } },
                    { $size: { $ifNull: [sourceDocument.tags, []] } }
                  ]
                }
              },
              in: {
                $cond: {
                  if: { $gt: ['$$totalTags', 0] },
                  then: { $divide: ['$$commonTags', '$$totalTags'] },
                  else: 0
                }
              }
            }
          },

          categorySimilarity: {
            $cond: {
              if: { $eq: ['$category', sourceDocument.category] },
              then: 0.2,
              else: 0
            }
          }
        }
      },
      {
        $addFields: {
          combinedSimilarity: {
            $add: [
              { $multiply: ['$similarityScore', 0.7] },
              { $multiply: ['$tagSimilarity', 0.2] },
              '$categorySimilarity'
            ]
          }
        }
      },
      {
        $match: {
          combinedSimilarity: { $gte: similarityThreshold }
        }
      },
      {
        $sort: { combinedSimilarity: -1 }
      },
      {
        $limit: limit
      },
      {
        $project: {
          _id: 1,
          title: 1,
          contentSnippet: { $substrCP: ['$content', 0, 200] },
          category: 1,
          tags: 1,
          author: 1,
          publishedDate: 1,
          similarityScore: includeScore ? 1 : 0,
          combinedSimilarity: includeScore ? 1 : 0,
          searchMetadata: 1
        }
      }
    ];

    const similarDocuments = await this.collections.documents.aggregate(pipeline).toArray();

    return {
      sourceDocumentId: documentId,
      sourceTitle: sourceDocument.title,
      similarDocuments: similarDocuments,
      metadata: {
        totalSimilar: similarDocuments.length,
        similarityThreshold: similarityThreshold,
        searchMethod: 'vector_similarity'
      }
    };
  }

  async generateRecommendations(userId, options = {}) {
    console.log(`Generating personalized recommendations for user: ${userId}`);

    const {
      limit = 15,
      diversityFactor = 0.3,
      includeExplanations = true
    } = options;

    // Get user profile and interaction history
    const userProfile = await this.collections.userProfiles.findOne({ userId: userId });

    if (!userProfile) {
      console.log('User profile not found, using general recommendations');
      return this.generateGeneralRecommendations(limit);
    }

    // Build user preference vector from interaction history
    const userVector = await this.buildUserPreferenceVector(userProfile);

    if (!userVector) {
      return this.generateGeneralRecommendations(limit);
    }

    // Find documents matching user preferences
    const pipeline = [
      {
        $vectorSearch: {
          index: 'content_vector_index',
          path: 'contentVector',
          queryVector: userVector,
          numCandidates: limit * 10,
          limit: limit * 3,
          filter: {
            $and: [
              // Exclude already read documents
              { _id: { $not: { $in: userProfile.readDocuments || [] } } },

              // Include preferred categories
              userProfile.preferredCategories && userProfile.preferredCategories.length > 0 ?
                { category: { $in: userProfile.preferredCategories } } :
                {},

              // Fresh content preference
              {
                publishedDate: {
                  $gte: new Date(Date.now() - 90 * 24 * 60 * 60 * 1000) // Last 90 days
                }
              }
            ]
          }
        }
      },
      {
        $addFields: {
          preferenceScore: { $meta: 'vectorSearchScore' },

          // Category affinity scoring
          categoryScore: {
            $switch: {
              branches: (userProfile.categoryAffinities || []).map(affinity => ({
                case: { $eq: ['$category', affinity.category] },
                then: affinity.score
              })),
              default: 0.5
            }
          },

          // Author following boost
          authorScore: {
            $cond: {
              if: { $in: ['$author', userProfile.followedAuthors || []] },
              then: 0.8,
              else: 0.4
            }
          },

          // Freshness scoring
          freshnessScore: {
            $divide: [
              { $subtract: [Date.now(), '$publishedDate'] },
              (30 * 24 * 60 * 60 * 1000) // 30 days in milliseconds
            ]
          }
        }
      },
      {
        $addFields: {
          recommendationScore: {
            $add: [
              { $multiply: ['$preferenceScore', 0.4] },
              { $multiply: ['$categoryScore', 0.25] },
              { $multiply: ['$authorScore', 0.2] },
              { $multiply: [{ $max: [0, { $subtract: [1, '$freshnessScore'] }] }, 0.15] }
            ]
          }
        }
      }
    ];

    // Apply diversity to avoid filter bubble
    if (diversityFactor > 0) {
      pipeline.push({
        $group: {
          _id: '$category',
          documents: {
            $push: {
              _id: '$_id',
              title: '$title',
              recommendationScore: '$recommendationScore',
              category: '$category',
              author: '$author',
              publishedDate: '$publishedDate',
              tags: '$tags'
            }
          },
          maxScore: { $max: '$recommendationScore' }
        }
      });

      pipeline.push({
        $sort: { maxScore: -1 }
      });

      // Select diverse recommendations
      pipeline.push({
        $project: {
          documents: {
            $slice: [
              { $sortArray: { input: '$documents', sortBy: { recommendationScore: -1 } } },
              Math.ceil(limit * diversityFactor)
            ]
          }
        }
      });

      pipeline.push({
        $unwind: '$documents'
      });

      pipeline.push({
        $replaceRoot: { newRoot: '$documents' }
      });
    }

    pipeline.push(
      {
        $sort: { recommendationScore: -1 }
      },
      {
        $limit: limit
      }
    );

    const recommendations = await this.collections.documents.aggregate(pipeline).toArray();

    // Generate explanations if requested
    if (includeExplanations) {
      for (const rec of recommendations) {
        rec.explanation = this.generateRecommendationExplanation(rec, userProfile);
      }
    }

    // Store recommendations for future analysis
    await this.collections.recommendations.insertOne({
      userId: userId,
      recommendations: recommendations.map(r => ({
        documentId: r._id,
        score: r.recommendationScore,
        explanation: r.explanation
      })),
      generatedAt: new Date(),
      algorithm: 'vector_preference_matching',
      diversityFactor: diversityFactor
    });

    return {
      userId: userId,
      recommendations: recommendations,
      metadata: {
        totalRecommendations: recommendations.length,
        algorithm: 'vector_preference_matching',
        diversityApplied: diversityFactor > 0,
        generatedAt: new Date()
      }
    };
  }

  // Helper methods for vector search operations

  async generateEmbedding(text) {
    // In production, this would call OpenAI API or other embedding service
    // For this example, we'll simulate embeddings

    // Simulate API call delay
    await new Promise(resolve => setTimeout(resolve, 100));

    // Generate mock embedding vector (in production, use actual embedding API)
    const mockEmbedding = Array.from({ length: this.vectorConfig.dimensions }, () => 
      Math.random() * 2 - 1 // Values between -1 and 1
    );

    return mockEmbedding;
  }

  async extractEntities(text) {
    // Simulate entity extraction (in production, use NLP service)
    const entities = [];

    // Basic keyword extraction simulation
    const words = text.toLowerCase().split(/\W+/);
    const entityKeywords = ['mongodb', 'database', 'javascript', 'python', 'ai', 'machine learning'];

    entityKeywords.forEach(keyword => {
      if (words.includes(keyword) || words.includes(keyword.replace(' ', ''))) {
        entities.push({
          text: keyword,
          type: 'technology',
          confidence: 0.8
        });
      }
    });

    return entities;
  }

  async extractKeyPhrases(text) {
    // Simulate key phrase extraction
    const sentences = text.split(/[.!?]+/);
    const keyPhrases = [];

    sentences.forEach(sentence => {
      const words = sentence.trim().split(/\s+/);
      if (words.length >= 3 && words.length <= 8) {
        keyPhrases.push({
          phrase: sentence.trim(),
          relevance: Math.random()
        });
      }
    });

    return keyPhrases.sort((a, b) => b.relevance - a.relevance).slice(0, 10);
  }

  analyzeContentCharacteristics(content) {
    const wordCount = content.split(/\s+/).length;
    const sentenceCount = content.split(/[.!?]+/).length;
    const avgWordsPerSentence = wordCount / sentenceCount;

    return {
      complexity: avgWordsPerSentence > 20 ? 'high' : avgWordsPerSentence > 15 ? 'medium' : 'low',
      topics: ['general'], // Would use topic modeling in production
      sentiment: Math.random() * 2 - 1, // -1 to 1 scale
      readabilityScore: Math.max(0, Math.min(100, 100 - (avgWordsPerSentence * 2)))
    };
  }

  calculateFreshnessScore(publishedDate) {
    const ageInDays = (Date.now() - publishedDate.getTime()) / (24 * 60 * 60 * 1000);
    return Math.max(0, Math.min(1, 1 - (ageInDays / 365))); // Decay over 1 year
  }

  generateSummary(content) {
    // Simple summary generation (first 200 characters)
    return content.length > 200 ? content.substring(0, 197) + '...' : content;
  }

  buildFilterExpression(filters) {
    const filterExpression = { $and: [] };

    if (filters.category) {
      filterExpression.$and.push({ category: { $eq: filters.category } });
    }

    if (filters.author) {
      filterExpression.$and.push({ author: { $eq: filters.author } });
    }

    if (filters.tags && filters.tags.length > 0) {
      filterExpression.$and.push({ tags: { $in: filters.tags } });
    }

    if (filters.dateRange) {
      filterExpression.$and.push({ 
        publishedDate: {
          $gte: new Date(filters.dateRange.start),
          $lte: new Date(filters.dateRange.end)
        }
      });
    }

    return filterExpression.$and.length > 0 ? filterExpression : {};
  }

  buildSearchFilterClauses(filters) {
    const clauses = [];

    if (filters.category) {
      clauses.push({ equals: { path: 'category', value: filters.category } });
    }

    if (filters.tags && filters.tags.length > 0) {
      clauses.push({ in: { path: 'tags', value: filters.tags } });
    }

    return clauses;
  }

  async logSearchAnalytics(analyticsData) {
    try {
      await this.collections.searchAnalytics.insertOne({
        ...analyticsData,
        sessionId: analyticsData.userContext?.sessionId,
        userId: analyticsData.userContext?.userId
      });
    } catch (error) {
      console.warn('Failed to log search analytics:', error.message);
    }
  }

  async buildUserPreferenceVector(userProfile) {
    if (!userProfile.interactionHistory || userProfile.interactionHistory.length === 0) {
      return null;
    }

    // Get vectors for user's previously interacted documents
    const interactedDocuments = await this.collections.documents.find(
      { 
        _id: { $in: userProfile.interactionHistory.slice(-20).map(h => h.documentId) } 
      },
      { projection: { contentVector: 1 } }
    ).toArray();

    if (interactedDocuments.length === 0) {
      return null;
    }

    // Calculate weighted average vector based on interaction types
    const weightedVectors = interactedDocuments.map((doc, index) => {
      const interaction = userProfile.interactionHistory.find(h => 
        h.documentId.toString() === doc._id.toString()
      );

      const weight = this.getInteractionWeight(interaction.type);
      return doc.contentVector.map(val => val * weight);
    });

    // Average the vectors
    const dimensions = weightedVectors[0].length;
    const avgVector = Array(dimensions).fill(0);

    weightedVectors.forEach(vector => {
      vector.forEach((val, i) => {
        avgVector[i] += val;
      });
    });

    return avgVector.map(val => val / weightedVectors.length);
  }

  getInteractionWeight(interactionType) {
    const weights = {
      'view': 0.1,
      'like': 0.3,
      'share': 0.5,
      'bookmark': 0.7,
      'comment': 0.8
    };
    return weights[interactionType] || 0.1;
  }

  generateRecommendationExplanation(recommendation, userProfile) {
    const explanations = [];

    if (userProfile.preferredCategories && userProfile.preferredCategories.includes(recommendation.category)) {
      explanations.push(`Matches your interest in ${recommendation.category}`);
    }

    if (userProfile.followedAuthors && userProfile.followedAuthors.includes(recommendation.author)) {
      explanations.push(`By ${recommendation.author}, an author you follow`);
    }

    if (recommendation.tags) {
      const matchingTags = recommendation.tags.filter(tag => 
        userProfile.interests && userProfile.interests.includes(tag)
      );
      if (matchingTags.length > 0) {
        explanations.push(`Related to ${matchingTags.slice(0, 2).join(' and ')}`);
      }
    }

    if (explanations.length === 0) {
      explanations.push('Similar to content you\'ve previously engaged with');
    }

    return explanations.join('; ');
  }

  async generateGeneralRecommendations(limit) {
    // Fallback recommendations based on popularity and quality
    const pipeline = [
      {
        $addFields: {
          popularityScore: {
            $add: [
              { $multiply: [{ $ifNull: ['$qualitySignals.engagementScore', 0.5] }, 0.4] },
              { $multiply: [{ $ifNull: ['$qualitySignals.authorityScore', 0.5] }, 0.3] },
              { $multiply: [{ $ifNull: ['$qualitySignals.freshnessScore', 0.5] }, 0.3] }
            ]
          }
        }
      },
      {
        $sort: { popularityScore: -1 }
      },
      {
        $limit: limit
      },
      {
        $project: {
          _id: 1,
          title: 1,
          contentSnippet: { $substrCP: ['$content', 0, 200] },
          category: 1,
          author: 1,
          publishedDate: 1,
          popularityScore: 1
        }
      }
    ];

    const recommendations = await this.collections.documents.aggregate(pipeline).toArray();

    return {
      recommendations: recommendations,
      metadata: {
        algorithm: 'popularity_based',
        totalRecommendations: recommendations.length
      }
    };
  }
}

// Benefits of MongoDB Atlas Vector Search:
// - Native vector database capabilities within MongoDB Atlas infrastructure
// - Seamless integration with existing MongoDB documents and operations  
// - Support for multiple vector similarity algorithms (cosine, euclidean, dot product)
// - Hybrid search combining vector similarity with traditional text search
// - Scalable vector indexing with automatic optimization and maintenance
// - Built-in filtering capabilities for combining semantic search with metadata filters
// - Real-time vector search with sub-second response times at scale
// - Integration with popular embedding models (OpenAI, Cohere, Hugging Face)
// - Support for multiple vector dimensions and embedding types
// - Advanced ranking and personalization capabilities for AI-powered applications

module.exports = {
  VectorSearchManager
};

Understanding MongoDB Vector Search Architecture

Advanced Vector Search Patterns and Optimization

Implement sophisticated vector search optimization techniques for production applications:

// Advanced vector search optimization and performance tuning
class VectorSearchOptimizer {
  constructor(db) {
    this.db = db;
    this.performanceMetrics = new Map();
    this.indexStrategies = {
      exactSearch: { type: 'exactSearch', precision: 1.0, speed: 'slow' },
      approximateSearch: { type: 'approximateSearch', precision: 0.95, speed: 'fast' },
      hierarchicalSearch: { type: 'hierarchicalSearch', precision: 0.98, speed: 'medium' }
    };
  }

  async optimizeVectorIndexConfiguration(collectionName, vectorField, options = {}) {
    console.log(`Optimizing vector index configuration for ${collectionName}.${vectorField}`);

    const {
      dimensions = 1536,
      similarityMetric = 'cosine',
      numCandidates = 1000,
      performanceTarget = 'balanced' // 'speed', 'accuracy', 'balanced'
    } = options;

    // Analyze existing data distribution
    const dataAnalysis = await this.analyzeVectorDataDistribution(collectionName, vectorField);

    // Determine optimal index configuration
    const indexConfig = this.calculateOptimalIndexConfig(
      dataAnalysis, 
      performanceTarget, 
      dimensions
    );

    // Create optimized vector search index configuration
    const optimizedIndex = {
      name: `optimized_${vectorField}_index`,
      definition: {
        fields: [
          {
            type: 'vector',
            path: vectorField,
            numDimensions: dimensions,
            similarity: similarityMetric
          },
          // Add filter fields based on common query patterns
          ...this.generateFilterFieldsFromAnalysis(dataAnalysis)
        ]
      },
      configuration: {
        // Advanced tuning parameters
        numCandidates: this.calculateOptimalCandidates(dataAnalysis.documentCount),
        ef: indexConfig.ef, // Search accuracy parameter
        efConstruction: indexConfig.efConstruction, // Build-time parameter
        maxConnections: indexConfig.maxConnections, // Graph connectivity

        // Performance optimizations
        vectorCompression: indexConfig.compressionEnabled,
        quantization: indexConfig.quantizationLevel,
        cachingStrategy: indexConfig.cachingStrategy
      }
    };

    console.log('Optimized vector index configuration:', optimizedIndex);

    return optimizedIndex;
  }

  async performVectorSearchBenchmark(collectionName, testQueries, indexConfigurations) {
    console.log(`Benchmarking vector search performance with ${testQueries.length} test queries`);

    const benchmarkResults = [];

    for (const config of indexConfigurations) {
      console.log(`Testing configuration: ${config.name}`);

      const configResults = {
        configurationName: config.name,
        queryResults: [],
        performanceMetrics: {
          avgLatency: 0,
          p95Latency: 0,
          p99Latency: 0,
          throughput: 0,
          accuracy: 0
        }
      };

      const latencies = [];
      const accuracyScores = [];

      const startTime = Date.now();

      for (let i = 0; i < testQueries.length; i++) {
        const query = testQueries[i];

        const queryStart = Date.now();

        try {
          const results = await this.db.collection(collectionName).aggregate([
            {
              $vectorSearch: {
                index: config.indexName,
                path: config.vectorField,
                queryVector: query.vector,
                numCandidates: config.numCandidates || 100,
                limit: query.limit || 10
              }
            },
            {
              $addFields: {
                score: { $meta: 'vectorSearchScore' }
              }
            }
          ]).toArray();

          const queryLatency = Date.now() - queryStart;
          latencies.push(queryLatency);

          // Calculate accuracy if ground truth available
          if (query.expectedResults) {
            const accuracy = this.calculateSearchAccuracy(results, query.expectedResults);
            accuracyScores.push(accuracy);
          }

          configResults.queryResults.push({
            queryIndex: i,
            resultCount: results.length,
            latency: queryLatency,
            topScore: results[0]?.score || 0
          });

        } catch (error) {
          console.error(`Query ${i} failed:`, error.message);
          configResults.queryResults.push({
            queryIndex: i,
            error: error.message,
            latency: null
          });
        }
      }

      const totalTime = Date.now() - startTime;

      // Calculate performance metrics
      const validLatencies = latencies.filter(l => l !== null);
      if (validLatencies.length > 0) {
        configResults.performanceMetrics.avgLatency = 
          validLatencies.reduce((sum, l) => sum + l, 0) / validLatencies.length;

        const sortedLatencies = validLatencies.sort((a, b) => a - b);
        configResults.performanceMetrics.p95Latency = 
          sortedLatencies[Math.floor(sortedLatencies.length * 0.95)];
        configResults.performanceMetrics.p99Latency = 
          sortedLatencies[Math.floor(sortedLatencies.length * 0.99)];

        configResults.performanceMetrics.throughput = 
          (validLatencies.length / totalTime) * 1000; // queries per second
      }

      if (accuracyScores.length > 0) {
        configResults.performanceMetrics.accuracy = 
          accuracyScores.reduce((sum, a) => sum + a, 0) / accuracyScores.length;
      }

      benchmarkResults.push(configResults);
    }

    // Analyze and rank configurations
    const rankedConfigurations = this.rankConfigurationsByPerformance(benchmarkResults);

    return {
      benchmarkResults: benchmarkResults,
      recommendations: rankedConfigurations,
      testMetadata: {
        totalQueries: testQueries.length,
        configurationstested: indexConfigurations.length,
        benchmarkDuration: Date.now() - startTime
      }
    };
  }

  async implementAdvancedVectorSearchPatterns(collectionName, searchPattern, options = {}) {
    console.log(`Implementing advanced vector search pattern: ${searchPattern}`);

    const patterns = {
      multiModalSearch: () => this.implementMultiModalSearch(collectionName, options),
      hierarchicalSearch: () => this.implementHierarchicalSearch(collectionName, options),
      temporalVectorSearch: () => this.implementTemporalVectorSearch(collectionName, options),
      facetedVectorSearch: () => this.implementFacetedVectorSearch(collectionName, options),
      clusterBasedSearch: () => this.implementClusterBasedSearch(collectionName, options)
    };

    if (!patterns[searchPattern]) {
      throw new Error(`Unknown search pattern: ${searchPattern}`);
    }

    return await patterns[searchPattern]();
  }

  async implementMultiModalSearch(collectionName, options) {
    // Multi-modal search combining text, image, and other vector embeddings
    const {
      textVector,
      imageVector,
      audioVector,
      weights = { text: 0.5, image: 0.3, audio: 0.2 },
      limit = 20
    } = options;

    const collection = this.db.collection(collectionName);

    // Combine multiple vector searches
    const pipeline = [
      {
        $vectorSearch: {
          index: 'multi_modal_index',
          path: 'textVector',
          queryVector: textVector,
          numCandidates: limit * 5,
          limit: limit * 2
        }
      },
      {
        $addFields: {
          textScore: { $meta: 'vectorSearchScore' }
        }
      }
    ];

    if (imageVector) {
      pipeline.push({
        $unionWith: {
          coll: collectionName,
          pipeline: [
            {
              $vectorSearch: {
                index: 'image_vector_index',
                path: 'imageVector',
                queryVector: imageVector,
                numCandidates: limit * 5,
                limit: limit * 2
              }
            },
            {
              $addFields: {
                imageScore: { $meta: 'vectorSearchScore' }
              }
            }
          ]
        }
      });
    }

    if (audioVector) {
      pipeline.push({
        $unionWith: {
          coll: collectionName,
          pipeline: [
            {
              $vectorSearch: {
                index: 'audio_vector_index', 
                path: 'audioVector',
                queryVector: audioVector,
                numCandidates: limit * 5,
                limit: limit * 2
              }
            },
            {
              $addFields: {
                audioScore: { $meta: 'vectorSearchScore' }
              }
            }
          ]
        }
      });
    }

    // Combine scores from different modalities
    pipeline.push({
      $group: {
        _id: '$_id',
        doc: { $first: '$$ROOT' },
        textScore: { $max: { $ifNull: ['$textScore', 0] } },
        imageScore: { $max: { $ifNull: ['$imageScore', 0] } },
        audioScore: { $max: { $ifNull: ['$audioScore', 0] } }
      }
    });

    pipeline.push({
      $addFields: {
        combinedScore: {
          $add: [
            { $multiply: ['$textScore', weights.text] },
            { $multiply: ['$imageScore', weights.image] },
            { $multiply: ['$audioScore', weights.audio] }
          ]
        }
      }
    });

    pipeline.push({
      $sort: { combinedScore: -1 }
    });

    pipeline.push({
      $limit: limit
    });

    const results = await collection.aggregate(pipeline).toArray();

    return {
      searchType: 'multi_modal',
      results: results,
      weights: weights,
      metadata: {
        modalities: Object.keys(weights).filter(k => options[k + 'Vector']),
        totalResults: results.length
      }
    };
  }

  async implementTemporalVectorSearch(collectionName, options) {
    // Time-aware vector search with temporal relevance
    const {
      queryVector,
      timeWindow = { days: 30 },
      temporalWeight = 0.3,
      limit = 20
    } = options;

    const collection = this.db.collection(collectionName);
    const cutoffDate = new Date(Date.now() - timeWindow.days * 24 * 60 * 60 * 1000);

    const pipeline = [
      {
        $vectorSearch: {
          index: 'temporal_vector_index',
          path: 'contentVector',
          queryVector: queryVector,
          numCandidates: limit * 10,
          limit: limit * 3,
          filter: {
            publishedDate: { $gte: cutoffDate }
          }
        }
      },
      {
        $addFields: {
          vectorScore: { $meta: 'vectorSearchScore' },

          // Calculate temporal relevance
          temporalScore: {
            $divide: [
              { $subtract: ['$publishedDate', cutoffDate] },
              { $subtract: [new Date(), cutoffDate] }
            ]
          }
        }
      },
      {
        $addFields: {
          combinedScore: {
            $add: [
              { $multiply: ['$vectorScore', 1 - temporalWeight] },
              { $multiply: ['$temporalScore', temporalWeight] }
            ]
          }
        }
      },
      {
        $sort: { combinedScore: -1 }
      },
      {
        $limit: limit
      }
    ];

    const results = await collection.aggregate(pipeline).toArray();

    return {
      searchType: 'temporal_vector',
      results: results,
      temporalWindow: timeWindow,
      temporalWeight: temporalWeight
    };
  }

  // Helper methods for vector search optimization

  async analyzeVectorDataDistribution(collectionName, vectorField) {
    const collection = this.db.collection(collectionName);

    // Sample documents to analyze distribution
    const sampleSize = 1000;
    const pipeline = [
      { $sample: { size: sampleSize } },
      {
        $project: {
          vectorLength: { $size: `$${vectorField}` },
          vectorMagnitude: {
            $sqrt: {
              $reduce: {
                input: `$${vectorField}`,
                initialValue: 0,
                in: { $add: ['$$value', { $multiply: ['$$this', '$$this'] }] }
              }
            }
          }
        }
      }
    ];

    const samples = await collection.aggregate(pipeline).toArray();

    const totalDocs = await collection.countDocuments();
    const avgMagnitude = samples.reduce((sum, doc) => sum + doc.vectorMagnitude, 0) / samples.length;

    return {
      documentCount: totalDocs,
      sampleSize: samples.length,
      avgVectorMagnitude: avgMagnitude,
      vectorDimensions: samples[0]?.vectorLength || 0,
      magnitudeDistribution: this.calculateDistributionStats(
        samples.map(s => s.vectorMagnitude)
      )
    };
  }

  calculateOptimalIndexConfig(dataAnalysis, performanceTarget, dimensions) {
    const baseConfig = {
      ef: 200,
      efConstruction: 400,
      maxConnections: 32,
      compressionEnabled: false,
      quantizationLevel: 'none',
      cachingStrategy: 'adaptive'
    };

    // Adjust based on data characteristics and performance target
    if (dataAnalysis.documentCount > 1000000) {
      baseConfig.compressionEnabled = true;
      baseConfig.quantizationLevel = 'int8';
    }

    switch (performanceTarget) {
      case 'speed':
        baseConfig.ef = 100;
        baseConfig.efConstruction = 200;
        baseConfig.quantizationLevel = 'int8';
        break;
      case 'accuracy':
        baseConfig.ef = 400;
        baseConfig.efConstruction = 800;
        baseConfig.maxConnections = 64;
        break;
      case 'balanced':
      default:
        // Use base configuration
        break;
    }

    return baseConfig;
  }

  generateFilterFieldsFromAnalysis(dataAnalysis) {
    // Generate common filter fields based on data analysis
    return [
      { type: 'filter', path: 'category' },
      { type: 'filter', path: 'publishedDate' },
      { type: 'filter', path: 'tags' }
    ];
  }

  calculateOptimalCandidates(documentCount) {
    // Calculate optimal numCandidates based on collection size
    if (documentCount < 10000) return Math.min(documentCount, 100);
    if (documentCount < 100000) return 200;
    if (documentCount < 1000000) return 500;
    return 1000;
  }

  calculateSearchAccuracy(results, expectedResults) {
    // Calculate precision@k accuracy metric
    const actualIds = new Set(results.map(r => r._id.toString()));
    const expectedIds = new Set(expectedResults.map(r => r._id.toString()));

    let matches = 0;
    for (const id of actualIds) {
      if (expectedIds.has(id)) matches++;
    }

    return matches / Math.min(results.length, expectedResults.length);
  }

  rankConfigurationsByPerformance(benchmarkResults) {
    // Rank configurations based on composite performance score
    return benchmarkResults
      .map(result => ({
        ...result,
        compositeScore: this.calculateCompositeScore(result.performanceMetrics)
      }))
      .sort((a, b) => b.compositeScore - a.compositeScore)
      .map((result, index) => ({
        rank: index + 1,
        configurationName: result.configurationName,
        compositeScore: result.compositeScore,
        metrics: result.performanceMetrics,
        recommendation: this.generateConfigurationRecommendation(result)
      }));
  }

  calculateCompositeScore(metrics) {
    // Weighted composite score combining latency, throughput, and accuracy
    const latencyScore = metrics.avgLatency ? Math.max(0, 1 - (metrics.avgLatency / 1000)) : 0;
    const throughputScore = Math.min(1, metrics.throughput / 100);
    const accuracyScore = metrics.accuracy || 0.8;

    return (latencyScore * 0.4 + throughputScore * 0.3 + accuracyScore * 0.3);
  }

  generateConfigurationRecommendation(result) {
    const metrics = result.performanceMetrics;
    const recommendations = [];

    if (metrics.avgLatency > 500) {
      recommendations.push('Consider reducing numCandidates or enabling quantization for better latency');
    }

    if (metrics.accuracy < 0.8) {
      recommendations.push('Increase ef parameter or numCandidates to improve search accuracy');
    }

    if (metrics.throughput < 10) {
      recommendations.push('Optimize index configuration or consider horizontal scaling');
    }

    return recommendations.length > 0 ? recommendations : ['Configuration performs within acceptable parameters'];
  }

  calculateDistributionStats(values) {
    const sorted = values.slice().sort((a, b) => a - b);
    const mean = values.reduce((sum, val) => sum + val, 0) / values.length;

    return {
      mean: mean,
      median: sorted[Math.floor(sorted.length / 2)],
      min: sorted[0],
      max: sorted[sorted.length - 1],
      stddev: Math.sqrt(values.reduce((sum, val) => sum + Math.pow(val - mean, 2), 0) / values.length)
    };
  }
}

SQL-Style Vector Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB vector search operations:

-- QueryLeaf vector search operations with SQL-familiar syntax

-- Create vector search index with SQL DDL
CREATE VECTOR INDEX content_embeddings_idx ON documents (
  content_vector VECTOR(1536) USING cosine_similarity
  WITH (
    num_candidates = 1000,
    index_type = 'hnsw',
    ef_construction = 400,
    max_connections = 32
  )
) 
INCLUDE (category, tags, published_date, author) AS filters;

-- Advanced semantic search with SQL-style vector operations
WITH semantic_query AS (
  -- Generate query embedding (integrated with embedding services)
  SELECT embed_text('machine learning algorithms for natural language processing') as query_vector
),

vector_search_results AS (
  SELECT 
    d.document_id,
    d.title,
    d.content,
    d.category,
    d.tags,
    d.author,
    d.published_date,

    -- Vector similarity search with cosine similarity
    VECTOR_SIMILARITY(d.content_vector, sq.query_vector, 'cosine') as similarity_score,

    -- Vector distance calculations
    VECTOR_DISTANCE(d.content_vector, sq.query_vector, 'euclidean') as euclidean_distance,
    VECTOR_DISTANCE(d.content_vector, sq.query_vector, 'manhattan') as manhattan_distance,

    -- Vector magnitude and normalization
    VECTOR_MAGNITUDE(d.content_vector) as vector_magnitude,
    VECTOR_NORMALIZE(d.content_vector) as normalized_vector

  FROM documents d
  CROSS JOIN semantic_query sq
  WHERE 
    -- Vector similarity threshold filtering
    VECTOR_SIMILARITY(d.content_vector, sq.query_vector, 'cosine') > 0.75

    -- Traditional filters combined with vector search
    AND d.category IN ('AI', 'Technology', 'Data Science')
    AND d.published_date >= CURRENT_DATE - INTERVAL '1 year'

    -- Vector search with K-nearest neighbors
    AND d.document_id IN (
      SELECT document_id 
      FROM VECTOR_KNN_SEARCH(
        table_name => 'documents',
        vector_column => 'content_vector', 
        query_vector => sq.query_vector,
        k => 50,
        distance_function => 'cosine'
      )
    )
),

enhanced_results AS (
  SELECT 
    vsr.*,

    -- Advanced similarity calculations
    VECTOR_DOT_PRODUCT(vsr.normalized_vector, sq.query_vector) as dot_product_similarity,

    -- Multi-vector comparison for hybrid matching
    GREATEST(
      VECTOR_SIMILARITY(d.title_vector, sq.query_vector, 'cosine'),
      vsr.similarity_score * 0.8
    ) as hybrid_similarity_score,

    -- Vector clustering and topic modeling
    VECTOR_CLUSTER_ID(vsr.content_vector, 'kmeans', 10) as topic_cluster,
    VECTOR_TOPIC_PROBABILITY(vsr.content_vector, ARRAY['AI', 'ML', 'NLP', 'Data Science']) as topic_probabilities,

    -- Temporal vector decay for freshness
    vsr.similarity_score * EXP(-0.1 * EXTRACT(DAYS FROM (CURRENT_DATE - vsr.published_date))) as time_decayed_similarity,

    -- Content quality boosting based on vector characteristics
    vsr.similarity_score * (1 + LOG(GREATEST(1, ARRAY_LENGTH(vsr.tags, 1)) / 10.0)) as quality_boosted_similarity,

    -- Personalization using user preference vectors
    COALESCE(
      VECTOR_SIMILARITY(vsr.content_vector, user_preference_vector('user_123'), 'cosine') * 0.3,
      0
    ) as personalization_boost

  FROM vector_search_results vsr
  CROSS JOIN semantic_query sq
  LEFT JOIN documents d ON vsr.document_id = d.document_id
  WHERE vsr.similarity_score > 0.70
),

final_ranked_results AS (
  SELECT 
    document_id,
    title,
    SUBSTRING(content, 1, 300) || '...' as content_preview,
    category,
    tags,
    author,
    published_date,

    -- Comprehensive relevance scoring
    ROUND((
      hybrid_similarity_score * 0.4 +
      time_decayed_similarity * 0.25 +
      quality_boosted_similarity * 0.2 +
      personalization_boost * 0.15
    )::numeric, 4) as final_relevance_score,

    -- Individual score components for analysis
    ROUND(similarity_score::numeric, 4) as base_similarity,
    ROUND(hybrid_similarity_score::numeric, 4) as hybrid_score,
    ROUND(time_decayed_similarity::numeric, 4) as freshness_score,
    ROUND(personalization_boost::numeric, 4) as personal_score,

    -- Vector metadata
    topic_cluster,
    topic_probabilities,
    vector_magnitude,

    -- Search result ranking
    ROW_NUMBER() OVER (ORDER BY final_relevance_score DESC) as search_rank,
    COUNT(*) OVER () as total_results

  FROM enhanced_results
  WHERE (
    hybrid_similarity_score * 0.4 +
    time_decayed_similarity * 0.25 +
    quality_boosted_similarity * 0.2 +
    personalization_boost * 0.15
  ) > 0.6
)

SELECT 
  search_rank,
  document_id,
  title,
  content_preview,
  category,
  STRING_AGG(DISTINCT tag, ', ' ORDER BY tag) as tags_summary,
  author,
  published_date,
  final_relevance_score,

  -- Explanation of ranking factors
  JSON_BUILD_OBJECT(
    'base_similarity', base_similarity,
    'hybrid_boost', hybrid_score - base_similarity,
    'freshness_impact', freshness_score - base_similarity,
    'personalization_impact', personal_score,
    'topic_cluster', topic_cluster,
    'primary_topics', (
      SELECT ARRAY_AGG(topic ORDER BY probability DESC)
      FROM UNNEST(topic_probabilities) WITH ORDINALITY AS t(probability, topic)
      WHERE probability > 0.1
      LIMIT 3
    )
  ) as ranking_explanation

FROM final_ranked_results
CROSS JOIN UNNEST(tags) as tag
GROUP BY search_rank, document_id, title, content_preview, category, author, 
         published_date, final_relevance_score, base_similarity, hybrid_score, 
         freshness_score, personal_score, topic_cluster, topic_probabilities
ORDER BY final_relevance_score DESC
LIMIT 20;

-- Advanced vector aggregation and analytics
WITH vector_analysis AS (
  SELECT 
    category,
    author,
    DATE_TRUNC('month', published_date) as month_bucket,

    -- Vector aggregation functions
    VECTOR_AVG(content_vector) as category_centroid_vector,
    VECTOR_STDDEV(content_vector) as vector_spread,

    -- Vector clustering within groups
    VECTOR_KMEANS_CENTROIDS(content_vector, 5) as sub_clusters,

    -- Similarity analysis within categories
    AVG(VECTOR_PAIRWISE_SIMILARITY(content_vector, 'cosine')) as avg_internal_similarity,
    MIN(VECTOR_PAIRWISE_SIMILARITY(content_vector, 'cosine')) as min_internal_similarity,
    MAX(VECTOR_PAIRWISE_SIMILARITY(content_vector, 'cosine')) as max_internal_similarity,

    -- Document count and metadata
    COUNT(*) as document_count,
    AVG(ARRAY_LENGTH(tags, 1)) as avg_tags_per_doc,
    AVG(LENGTH(content)) as avg_content_length,

    -- Vector quality metrics
    AVG(VECTOR_MAGNITUDE(content_vector)) as avg_vector_magnitude,
    STDDEV(VECTOR_MAGNITUDE(content_vector)) as vector_magnitude_stddev

  FROM documents
  WHERE published_date >= CURRENT_DATE - INTERVAL '2 years'
    AND content_vector IS NOT NULL
  GROUP BY category, author, DATE_TRUNC('month', published_date)
),

cross_category_analysis AS (
  SELECT 
    va1.category as category_a,
    va2.category as category_b,

    -- Cross-category vector similarity
    VECTOR_SIMILARITY(va1.category_centroid_vector, va2.category_centroid_vector, 'cosine') as category_similarity,

    -- Content overlap analysis
    OVERLAP_COEFFICIENT(va1.category, va2.category, 'tags') as tag_overlap,
    OVERLAP_COEFFICIENT(va1.category, va2.category, 'authors') as author_overlap,

    -- Temporal correlation
    CORRELATION(va1.document_count, va2.document_count) OVER (
      PARTITION BY va1.category, va2.category 
      ORDER BY va1.month_bucket
    ) as temporal_correlation

  FROM vector_analysis va1
  CROSS JOIN vector_analysis va2
  WHERE va1.category != va2.category
    AND va1.month_bucket = va2.month_bucket
    AND va1.document_count >= 5
    AND va2.document_count >= 5
),

semantic_recommendations AS (
  SELECT 
    category,

    -- Find most similar categories for recommendation
    ARRAY_AGG(
      category_b ORDER BY category_similarity DESC
    ) FILTER (WHERE category_similarity > 0.7) as similar_categories,

    -- Trending analysis
    CASE 
      WHEN temporal_correlation > 0.8 THEN 'strongly_correlated'
      WHEN temporal_correlation > 0.5 THEN 'moderately_correlated' 
      WHEN temporal_correlation < -0.5 THEN 'inversely_correlated'
      ELSE 'independent'
    END as trend_relationship,

    -- Content strategy recommendations
    CASE
      WHEN AVG(category_similarity) > 0.8 THEN 'High content overlap - consider specialization'
      WHEN AVG(category_similarity) < 0.3 THEN 'Low overlap - good content differentiation'
      ELSE 'Moderate overlap - balanced content strategy'
    END as content_strategy_recommendation

  FROM cross_category_analysis
  GROUP BY category, temporal_correlation
)

SELECT 
  va.category,
  va.document_count,
  ROUND(va.avg_internal_similarity::numeric, 3) as content_consistency_score,
  ROUND(va.avg_vector_magnitude::numeric, 3) as content_richness_score,

  -- Vector-based content insights
  CASE 
    WHEN va.avg_internal_similarity > 0.8 THEN 'Highly consistent content'
    WHEN va.avg_internal_similarity > 0.6 THEN 'Moderately consistent content'
    ELSE 'Diverse content range'
  END as content_consistency_assessment,

  -- Similar categories for cross-promotion
  sr.similar_categories,
  sr.trend_relationship,
  sr.content_strategy_recommendation,

  -- Growth and engagement potential
  CASE
    WHEN va.document_count > LAG(va.document_count) OVER (
      PARTITION BY va.category ORDER BY va.month_bucket
    ) THEN 'Growing'
    WHEN va.document_count < LAG(va.document_count) OVER (
      PARTITION BY va.category ORDER BY va.month_bucket  
    ) THEN 'Declining'
    ELSE 'Stable'
  END as content_trend,

  -- Vector search optimization recommendations
  CASE
    WHEN va.vector_magnitude_stddev > 0.5 THEN 'Consider vector normalization for consistent search performance'
    WHEN va.avg_vector_magnitude < 0.1 THEN 'Low vector magnitudes may indicate embedding quality issues'
    ELSE 'Vector embeddings appear well-distributed'
  END as search_optimization_advice

FROM vector_analysis va
LEFT JOIN semantic_recommendations sr ON va.category = sr.category
WHERE va.month_bucket >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '6 months')
ORDER BY va.document_count DESC, va.avg_internal_similarity DESC;

-- Real-time vector search performance monitoring
WITH search_performance_metrics AS (
  SELECT 
    DATE_TRUNC('hour', search_timestamp) as hour_bucket,
    search_type,

    -- Query performance metrics
    COUNT(*) as total_searches,
    AVG(response_time_ms) as avg_response_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_time_ms) as p95_response_time,
    MAX(response_time_ms) as max_response_time,

    -- Result quality metrics
    AVG(result_count) as avg_results_returned,
    AVG(CASE WHEN result_count > 0 THEN top_similarity_score ELSE NULL END) as avg_top_similarity,
    AVG(user_satisfaction_score) as avg_user_satisfaction,

    -- Vector search specific metrics
    AVG(vector_candidates_examined) as avg_candidates_examined,
    AVG(vector_index_hit_ratio) as avg_index_hit_ratio,
    COUNT(*) FILTER (WHERE similarity_threshold_met = true) as threshold_met_count,

    -- Error and timeout analysis
    COUNT(*) FILTER (WHERE search_timeout = true) as timeout_count,
    COUNT(*) FILTER (WHERE search_error IS NOT NULL) as error_count,
    STRING_AGG(DISTINCT search_error, '; ') as error_types

  FROM vector_search_log
  WHERE search_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  GROUP BY DATE_TRUNC('hour', search_timestamp), search_type
),

performance_alerts AS (
  SELECT 
    hour_bucket,
    search_type,
    total_searches,
    avg_response_time,
    p95_response_time,
    avg_user_satisfaction,

    -- Performance alerting logic
    CASE 
      WHEN avg_response_time > 1000 THEN 'CRITICAL - High average latency'
      WHEN p95_response_time > 2000 THEN 'WARNING - High P95 latency'
      WHEN avg_user_satisfaction < 0.7 THEN 'WARNING - Low user satisfaction'
      WHEN timeout_count > total_searches * 0.05 THEN 'WARNING - High timeout rate'
      ELSE 'NORMAL'
    END as performance_status,

    -- Optimization recommendations
    CASE
      WHEN avg_candidates_examined > 10000 THEN 'Consider reducing numCandidates for better performance'
      WHEN avg_index_hit_ratio < 0.8 THEN 'Index may need rebuilding - low hit ratio detected'
      WHEN error_count > 0 THEN 'Investigate errors: ' || error_types
      ELSE 'Performance within normal parameters'
    END as optimization_recommendation,

    -- Trending analysis
    avg_response_time - LAG(avg_response_time) OVER (
      PARTITION BY search_type 
      ORDER BY hour_bucket
    ) as latency_trend,

    total_searches - LAG(total_searches) OVER (
      PARTITION BY search_type
      ORDER BY hour_bucket  
    ) as volume_trend

  FROM search_performance_metrics
)

SELECT 
  hour_bucket,
  search_type,
  total_searches,
  ROUND(avg_response_time::numeric, 1) as avg_latency_ms,
  ROUND(p95_response_time::numeric, 1) as p95_latency_ms,
  ROUND(avg_user_satisfaction::numeric, 2) as satisfaction_score,
  performance_status,
  optimization_recommendation,

  -- Trend indicators
  CASE 
    WHEN latency_trend > 200 THEN 'DEGRADING'
    WHEN latency_trend < -200 THEN 'IMPROVING' 
    ELSE 'STABLE'
  END as latency_trend_status,

  CASE
    WHEN volume_trend > total_searches * 0.2 THEN 'HIGH_GROWTH'
    WHEN volume_trend > total_searches * 0.1 THEN 'GROWING'
    WHEN volume_trend < -total_searches * 0.1 THEN 'DECLINING'
    ELSE 'STABLE'
  END as volume_trend_status

FROM performance_alerts
WHERE performance_status != 'NORMAL' OR hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '6 hours'
ORDER BY hour_bucket DESC, total_searches DESC;

-- QueryLeaf provides comprehensive vector search capabilities:
-- 1. SQL-familiar vector operations with VECTOR_SIMILARITY, VECTOR_DISTANCE functions
-- 2. Advanced K-nearest neighbors search with customizable distance functions
-- 3. Hybrid search combining vector similarity with traditional text search
-- 4. Vector aggregation functions for analytics and clustering
-- 5. Real-time performance monitoring and optimization recommendations
-- 6. Multi-modal vector search across text, image, and audio embeddings
-- 7. Temporal vector search with time-aware relevance scoring
-- 8. Vector-based recommendation systems with personalization
-- 9. Integration with MongoDB's native vector search optimizations
-- 10. Familiar SQL patterns for complex vector analytics and reporting

Best Practices for Vector Search Implementation

Vector Index Design Strategy

Essential principles for optimal MongoDB vector search design:

  1. Embedding Selection: Choose appropriate embedding models based on content type and use case requirements
  2. Index Configuration: Optimize vector index parameters for the balance of accuracy and performance needed
  3. Filtering Strategy: Design metadata filters to narrow search space before vector similarity calculations
  4. Dimensionality Management: Select optimal embedding dimensions based on content complexity and performance requirements
  5. Update Patterns: Plan for efficient vector updates and re-indexing as content changes
  6. Quality Assurance: Implement vector quality validation and monitoring for embedding consistency

Performance and Scalability

Optimize MongoDB vector search for production workloads:

  1. Index Optimization: Monitor and tune vector index parameters based on actual query patterns
  2. Hybrid Search: Combine vector and traditional search for optimal relevance and performance
  3. Caching Strategy: Implement intelligent caching for frequently accessed vectors and query results
  4. Resource Planning: Plan memory and compute resources for vector search operations at scale
  5. Monitoring Setup: Implement comprehensive vector search performance and quality monitoring
  6. Testing Strategy: Develop thorough testing for vector search accuracy and performance characteristics

Conclusion

MongoDB Atlas Vector Search provides native vector database capabilities that eliminate the complexity and infrastructure overhead of separate vector databases while enabling sophisticated semantic search and AI-powered applications. The seamless integration with MongoDB's document model allows developers to combine traditional database operations with advanced vector search in a unified platform.

Key MongoDB Vector Search benefits include:

  • Native Integration: Built-in vector search capabilities within MongoDB Atlas infrastructure
  • Semantic Understanding: Advanced similarity search that understands meaning and context
  • Hybrid Search: Combining vector similarity with traditional text search and metadata filtering
  • Scalable Performance: Production-ready vector indexing with sub-second response times
  • AI-Ready Platform: Direct integration with popular embedding models and AI frameworks
  • Familiar Operations: Vector search operations integrated with standard MongoDB query patterns

Whether you're building recommendation systems, semantic search applications, RAG implementations, or any application requiring intelligent content discovery, MongoDB Atlas Vector Search with QueryLeaf's familiar SQL interface provides the foundation for modern AI-powered applications.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB vector search operations while providing SQL-familiar vector query syntax, similarity functions, and performance optimization. Advanced vector search patterns, multi-modal search, and semantic analytics are seamlessly handled through familiar SQL constructs, making sophisticated AI-powered search both powerful and accessible to SQL-oriented development teams.

The integration of native vector search capabilities with SQL-style operations makes MongoDB an ideal platform for applications requiring both intelligent semantic search and familiar database interaction patterns, ensuring your AI-powered applications remain both innovative and maintainable as they scale and evolve.

MongoDB Time-Series Collections for IoT and Analytics: High-Performance Data Management with SQL-Style Time-Series Operations

Modern IoT applications, sensor networks, and real-time analytics systems generate massive volumes of time-series data that require specialized storage and query optimization to maintain performance at scale. Traditional relational databases struggle with the high ingestion rates, storage efficiency, and specialized query patterns typical of time-series workloads.

MongoDB Time-Series Collections provide purpose-built optimization for temporal data storage and retrieval, enabling efficient handling of high-frequency sensor data, metrics, logs, and analytics with automatic bucketing, compression, and time-based indexing. Unlike generic document storage that treats all data equally, time-series collections optimize for temporal access patterns, data compression, and analytical aggregations.

The Traditional Time-Series Data Challenge

Conventional approaches to managing high-volume time-series data face significant scalability and performance limitations:

-- Traditional relational approach - poor performance with high-volume time-series data

-- PostgreSQL time-series table with performance challenges
CREATE TABLE sensor_readings (
  id BIGSERIAL PRIMARY KEY,
  device_id VARCHAR(50) NOT NULL,
  sensor_type VARCHAR(50) NOT NULL,
  timestamp TIMESTAMP WITH TIME ZONE NOT NULL,
  value NUMERIC(15,6) NOT NULL,
  unit VARCHAR(20),
  location_lat NUMERIC(10,8),
  location_lng NUMERIC(11,8),
  quality_score INTEGER,
  metadata JSONB,
  created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

-- Indexes for time-series queries (heavy overhead)
CREATE INDEX idx_sensor_device_time ON sensor_readings(device_id, timestamp DESC);
CREATE INDEX idx_sensor_type_time ON sensor_readings(sensor_type, timestamp DESC);
CREATE INDEX idx_sensor_time_range ON sensor_readings(timestamp DESC);
CREATE INDEX idx_sensor_location ON sensor_readings USING GIST(location_lat, location_lng);

-- High-frequency data insertion challenges
INSERT INTO sensor_readings (device_id, sensor_type, timestamp, value, unit, location_lat, location_lng, quality_score, metadata)
SELECT 
  'device_' || (i % 1000)::text,
  CASE (i % 5)
    WHEN 0 THEN 'temperature'
    WHEN 1 THEN 'humidity'
    WHEN 2 THEN 'pressure'
    WHEN 3 THEN 'light'
    ELSE 'motion'
  END,
  NOW() - (i || ' seconds')::interval,
  RANDOM() * 100,
  CASE (i % 5)
    WHEN 0 THEN 'celsius'
    WHEN 1 THEN 'percent'
    WHEN 2 THEN 'pascal'
    WHEN 3 THEN 'lux'
    ELSE 'boolean'
  END,
  40.7128 + (RANDOM() - 0.5) * 0.1,
  -74.0060 + (RANDOM() - 0.5) * 0.1,
  (RANDOM() * 100)::integer,
  ('{"source": "sensor_' || (i % 50)::text || '", "batch_id": "' || (i / 1000)::text || '"}')::jsonb
FROM generate_series(1, 1000000) as i;

-- Complex time-series aggregation with performance issues
WITH hourly_aggregates AS (
  SELECT 
    device_id,
    sensor_type,
    DATE_TRUNC('hour', timestamp) as hour_bucket,

    -- Basic aggregations (expensive with large datasets)
    COUNT(*) as reading_count,
    AVG(value) as avg_value,
    MIN(value) as min_value,
    MAX(value) as max_value,
    STDDEV(value) as std_deviation,

    -- Percentile calculations (very expensive)
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value) as median,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY value) as p95,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY value) as p99,

    -- Quality metrics
    AVG(quality_score) as avg_quality,
    COUNT(*) FILTER (WHERE quality_score > 90) as high_quality_readings,

    -- Data completeness analysis
    COUNT(DISTINCT EXTRACT(MINUTE FROM timestamp)) as minutes_with_data,
    (COUNT(DISTINCT EXTRACT(MINUTE FROM timestamp)) / 60.0 * 100) as data_completeness_percent,

    -- Location analysis (expensive with geographic functions)
    AVG(location_lat) as avg_lat,
    AVG(location_lng) as avg_lng,
    ST_ConvexHull(ST_Collect(ST_Point(location_lng, location_lat))) as reading_area

  FROM sensor_readings 
  WHERE timestamp >= NOW() - INTERVAL '7 days'
    AND timestamp < NOW()
    AND quality_score > 50
  GROUP BY device_id, sensor_type, DATE_TRUNC('hour', timestamp)
),

daily_trends AS (
  SELECT 
    device_id,
    sensor_type,
    DATE_TRUNC('day', hour_bucket) as day_bucket,

    -- Daily aggregations from hourly data
    SUM(reading_count) as daily_reading_count,
    AVG(avg_value) as daily_avg_value,
    MIN(min_value) as daily_min_value,
    MAX(max_value) as daily_max_value,

    -- Trend analysis (complex calculations)
    REGR_SLOPE(avg_value, EXTRACT(HOUR FROM hour_bucket)) as hourly_trend_slope,
    REGR_R2(avg_value, EXTRACT(HOUR FROM hour_bucket)) as trend_correlation,

    -- Volatility analysis
    STDDEV(avg_value) as daily_volatility,
    (MAX(avg_value) - MIN(avg_value)) as daily_range,

    -- Peak hour identification
    (array_agg(EXTRACT(HOUR FROM hour_bucket) ORDER BY avg_value DESC))[1] as peak_hour,
    (array_agg(avg_value ORDER BY avg_value DESC))[1] as peak_value,

    -- Data quality metrics
    AVG(avg_quality) as daily_avg_quality,
    AVG(data_completeness_percent) as avg_completeness

  FROM hourly_aggregates
  GROUP BY device_id, sensor_type, DATE_TRUNC('day', hour_bucket)
),

sensor_performance_analysis AS (
  SELECT 
    s.device_id,
    s.sensor_type,

    -- Performance metrics over analysis period
    COUNT(*) as total_readings,
    AVG(s.value) as overall_avg_value,
    STDDEV(s.value) as overall_std_deviation,

    -- Operational metrics
    EXTRACT(EPOCH FROM (MAX(s.timestamp) - MIN(s.timestamp))) / 3600 as hours_active,
    COUNT(*) / NULLIF(EXTRACT(EPOCH FROM (MAX(s.timestamp) - MIN(s.timestamp))) / 3600, 0) as avg_readings_per_hour,

    -- Reliability analysis
    COUNT(*) FILTER (WHERE s.quality_score > 90) / COUNT(*)::float as high_quality_ratio,
    COUNT(*) FILTER (WHERE s.value IS NULL) / COUNT(*)::float as null_value_ratio,

    -- Geographic consistency
    STDDEV(s.location_lat) as lat_consistency,
    STDDEV(s.location_lng) as lng_consistency,

    -- Recent performance vs historical
    AVG(s.value) FILTER (WHERE s.timestamp >= NOW() - INTERVAL '1 day') as recent_avg,
    AVG(s.value) FILTER (WHERE s.timestamp < NOW() - INTERVAL '1 day') as historical_avg,

    -- Anomaly detection (simplified)
    COUNT(*) FILTER (WHERE ABS(s.value - AVG(s.value) OVER (PARTITION BY s.device_id, s.sensor_type)) > 3 * STDDEV(s.value) OVER (PARTITION BY s.device_id, s.sensor_type)) as anomaly_count

  FROM sensor_readings s
  WHERE s.timestamp >= NOW() - INTERVAL '7 days'
  GROUP BY s.device_id, s.sensor_type
)

SELECT 
  spa.device_id,
  spa.sensor_type,
  spa.total_readings,
  ROUND(spa.overall_avg_value::numeric, 3) as avg_value,
  ROUND(spa.overall_std_deviation::numeric, 3) as std_deviation,
  ROUND(spa.hours_active::numeric, 1) as hours_active,
  ROUND(spa.avg_readings_per_hour::numeric, 1) as readings_per_hour,
  ROUND(spa.high_quality_ratio::numeric * 100, 1) as quality_percent,
  spa.anomaly_count,

  -- Daily trend summary
  ROUND(AVG(dt.daily_avg_value)::numeric, 3) as avg_daily_value,
  ROUND(STDDEV(dt.daily_avg_value)::numeric, 3) as daily_volatility,
  ROUND(AVG(dt.hourly_trend_slope)::numeric, 6) as avg_hourly_trend,

  -- Performance assessment
  CASE 
    WHEN spa.high_quality_ratio > 0.95 AND spa.avg_readings_per_hour > 50 THEN 'excellent'
    WHEN spa.high_quality_ratio > 0.90 AND spa.avg_readings_per_hour > 20 THEN 'good'
    WHEN spa.high_quality_ratio > 0.75 AND spa.avg_readings_per_hour > 5 THEN 'acceptable'
    ELSE 'poor'
  END as performance_rating,

  -- Alerting flags
  spa.anomaly_count > spa.total_readings * 0.05 as high_anomaly_rate,
  ABS(spa.recent_avg - spa.historical_avg) > spa.overall_std_deviation * 2 as significant_recent_change,
  spa.avg_readings_per_hour < 1 as low_frequency_readings

FROM sensor_performance_analysis spa
LEFT JOIN daily_trends dt ON spa.device_id = dt.device_id AND spa.sensor_type = dt.sensor_type
GROUP BY spa.device_id, spa.sensor_type, spa.total_readings, spa.overall_avg_value, 
         spa.overall_std_deviation, spa.hours_active, spa.avg_readings_per_hour, 
         spa.high_quality_ratio, spa.anomaly_count, spa.recent_avg, spa.historical_avg
ORDER BY spa.total_readings DESC, spa.avg_readings_per_hour DESC;

-- Problems with traditional time-series approaches:
-- 1. Poor insertion performance due to index maintenance overhead
-- 2. Inefficient storage with high space usage for repetitive time-series data
-- 3. Complex partitioning strategies required for time-based data management
-- 4. Expensive aggregation queries across large time ranges
-- 5. Limited built-in optimization for temporal access patterns
-- 6. Manual compression and archival strategies needed
-- 7. Poor performance with high-cardinality device/sensor combinations
-- 8. Complex schema evolution for changing sensor types and metadata
-- 9. Difficulty with real-time analytics on streaming time-series data
-- 10. Limited support for time-based bucketing and automatic rollups

-- MySQL time-series approach (even more limitations)
CREATE TABLE mysql_sensor_data (
  id BIGINT AUTO_INCREMENT PRIMARY KEY,
  device_id VARCHAR(50) NOT NULL,
  sensor_type VARCHAR(50) NOT NULL,
  reading_time DATETIME(3) NOT NULL,
  sensor_value DECIMAL(15,6),
  metadata JSON,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

  INDEX idx_device_time (device_id, reading_time),
  INDEX idx_sensor_time (sensor_type, reading_time)
) ENGINE=InnoDB;

-- Basic time-series aggregation with MySQL limitations
SELECT 
  device_id,
  sensor_type,
  DATE_FORMAT(reading_time, '%Y-%m-%d %H:00:00') as hour_bucket,
  COUNT(*) as reading_count,
  AVG(sensor_value) as avg_value,
  MIN(sensor_value) as min_value,
  MAX(sensor_value) as max_value,
  STDDEV(sensor_value) as std_deviation
FROM mysql_sensor_data
WHERE reading_time >= DATE_SUB(NOW(), INTERVAL 7 DAY)
GROUP BY device_id, sensor_type, DATE_FORMAT(reading_time, '%Y-%m-%d %H:00:00')
ORDER BY device_id, sensor_type, hour_bucket;

-- MySQL limitations:
-- - Limited JSON support for sensor metadata and flexible schemas
-- - Basic time functions without sophisticated temporal operations
-- - Poor performance with large time-series datasets
-- - No native time-series optimizations or automatic bucketing
-- - Limited aggregation and windowing functions
-- - Simple partitioning options for time-based data
-- - Minimal support for real-time analytics patterns

MongoDB Time-Series Collections provide optimized temporal data management:

// MongoDB Time-Series Collections - optimized for high-performance temporal data
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('iot_platform');

// Advanced time-series data management and analytics platform
class TimeSeriesDataManager {
  constructor(db) {
    this.db = db;
    this.collections = new Map();
    this.compressionConfig = {
      blockSize: 4096,
      compressionLevel: 9,
      bucketing: 'automatic'
    };
    this.indexingStrategy = {
      timeField: 'timestamp',
      metaField: 'metadata',
      granularity: 'minutes'
    };
  }

  async initializeTimeSeriesCollections() {
    console.log('Initializing optimized time-series collections...');

    // Create time-series collection for sensor data with optimal configuration
    try {
      await this.db.createCollection('sensor_readings', {
        timeseries: {
          timeField: 'timestamp',
          metaField: 'metadata',  // Groups related time-series together
          granularity: 'minutes'  // Optimize for minute-level bucketing
        },
        storageEngine: {
          wiredTiger: {
            configString: 'block_compressor=zstd'  // High compression for time-series data
          }
        }
      });

      console.log('Created time-series collection: sensor_readings');
      this.collections.set('sensor_readings', this.db.collection('sensor_readings'));

    } catch (error) {
      if (error.code !== 48) { // Collection already exists
        throw error;
      }
      console.log('Time-series collection sensor_readings already exists');
      this.collections.set('sensor_readings', this.db.collection('sensor_readings'));
    }

    // Create additional optimized time-series collections for different data types
    const timeSeriesCollections = [
      {
        name: 'device_metrics',
        granularity: 'seconds',  // High-frequency system metrics
        metaField: 'device'
      },
      {
        name: 'environmental_data',
        granularity: 'minutes',  // Environmental sensor data
        metaField: 'location'
      },
      {
        name: 'application_logs',
        granularity: 'seconds',  // Application performance logs
        metaField: 'application'
      },
      {
        name: 'financial_ticks',
        granularity: 'seconds',  // Financial market data
        metaField: 'symbol'
      }
    ];

    for (const config of timeSeriesCollections) {
      try {
        await this.db.createCollection(config.name, {
          timeseries: {
            timeField: 'timestamp',
            metaField: config.metaField,
            granularity: config.granularity
          },
          storageEngine: {
            wiredTiger: {
              configString: 'block_compressor=zstd'
            }
          }
        });

        this.collections.set(config.name, this.db.collection(config.name));
        console.log(`Created time-series collection: ${config.name}`);

      } catch (error) {
        if (error.code !== 48) {
          throw error;
        }
        this.collections.set(config.name, this.db.collection(config.name));
      }
    }

    // Create optimal indexes for time-series queries
    await this.createTimeSeriesIndexes();

    return Array.from(this.collections.keys());
  }

  async createTimeSeriesIndexes() {
    console.log('Creating optimized time-series indexes...');

    const sensorReadings = this.collections.get('sensor_readings');

    // Compound indexes optimized for common time-series query patterns
    const indexSpecs = [
      // Primary access pattern: device + time range
      { 'metadata.deviceId': 1, 'timestamp': 1 },

      // Sensor type + time pattern
      { 'metadata.sensorType': 1, 'timestamp': 1 },

      // Location-based queries with time
      { 'metadata.location': '2dsphere', 'timestamp': 1 },

      // Quality-based filtering with time
      { 'metadata.qualityScore': 1, 'timestamp': 1 },

      // Multi-device aggregation patterns
      { 'metadata.deviceGroup': 1, 'metadata.sensorType': 1, 'timestamp': 1 },

      // Real-time queries (recent data first)
      { 'timestamp': -1 },

      // Data source tracking
      { 'metadata.source': 1, 'timestamp': 1 }
    ];

    for (const indexSpec of indexSpecs) {
      try {
        await sensorReadings.createIndex(indexSpec, {
          background: true,
          partialFilterExpression: { 
            'metadata.qualityScore': { $gt: 0 } // Only index quality data
          }
        });
      } catch (error) {
        console.warn(`Index creation warning for ${JSON.stringify(indexSpec)}:`, error.message);
      }
    }

    console.log('Time-series indexes created successfully');
  }

  async ingestHighFrequencyData(sensorData) {
    console.log(`Ingesting ${sensorData.length} high-frequency sensor readings...`);

    const sensorReadings = this.collections.get('sensor_readings');
    const batchSize = 1000;
    const batches = [];

    // Prepare data with time-series optimized structure
    const optimizedData = sensorData.map(reading => ({
      timestamp: new Date(reading.timestamp),
      value: reading.value,

      // Metadata field for grouping and filtering
      metadata: {
        deviceId: reading.deviceId,
        sensorType: reading.sensorType,
        deviceGroup: reading.deviceGroup || 'default',
        location: {
          type: 'Point',
          coordinates: [reading.longitude, reading.latitude]
        },
        unit: reading.unit,
        qualityScore: reading.qualityScore || 100,
        source: reading.source || 'unknown',
        firmware: reading.firmware,
        calibrationDate: reading.calibrationDate,

        // Additional contextual metadata
        environment: {
          temperature: reading.ambientTemperature,
          humidity: reading.ambientHumidity,
          pressure: reading.ambientPressure
        },

        // Operational metadata
        batteryLevel: reading.batteryLevel,
        signalStrength: reading.signalStrength,
        networkLatency: reading.networkLatency
      },

      // Optional: Additional measurement fields for multi-sensor devices
      ...(reading.additionalMeasurements && {
        measurements: reading.additionalMeasurements
      })
    }));

    // Split into batches for optimal insertion performance
    for (let i = 0; i < optimizedData.length; i += batchSize) {
      batches.push(optimizedData.slice(i, i + batchSize));
    }

    // Insert batches with optimal write concern for time-series data
    let totalInserted = 0;
    const insertionStart = Date.now();

    for (const batch of batches) {
      try {
        const result = await sensorReadings.insertMany(batch, {
          ordered: false,  // Allow partial success for high-throughput ingestion
          writeConcern: { w: 1, j: false }  // Optimize for speed over durability for sensor data
        });

        totalInserted += result.insertedCount;

      } catch (error) {
        console.error('Batch insertion error:', error.message);

        // Handle partial batch failures gracefully
        if (error.result && error.result.insertedCount) {
          totalInserted += error.result.insertedCount;
          console.log(`Partial batch success: ${error.result.insertedCount} documents inserted`);
        }
      }
    }

    const insertionTime = Date.now() - insertionStart;
    const throughput = Math.round(totalInserted / (insertionTime / 1000));

    console.log(`High-frequency ingestion completed: ${totalInserted} documents in ${insertionTime}ms (${throughput} docs/sec)`);

    return {
      totalInserted,
      insertionTime,
      throughput,
      batchCount: batches.length
    };
  }

  async performTimeSeriesAnalytics(deviceId, timeRange, analysisType = 'comprehensive') {
    console.log(`Performing ${analysisType} time-series analytics for device: ${deviceId}`);

    const sensorReadings = this.collections.get('sensor_readings');
    const startTime = new Date(Date.now() - timeRange.hours * 60 * 60 * 1000);
    const endTime = new Date();

    // Comprehensive time-series aggregation pipeline
    const pipeline = [
      // Stage 1: Time range filtering with index utilization
      {
        $match: {
          'metadata.deviceId': deviceId,
          timestamp: {
            $gte: startTime,
            $lte: endTime
          },
          'metadata.qualityScore': { $gt: 50 }  // Filter low-quality readings
        }
      },

      // Stage 2: Add time-based bucketing fields
      {
        $addFields: {
          hourBucket: {
            $dateTrunc: {
              date: '$timestamp',
              unit: 'hour'
            }
          },
          minuteBucket: {
            $dateTrunc: {
              date: '$timestamp',
              unit: 'minute'
            }
          },
          dayOfWeek: { $dayOfWeek: '$timestamp' },
          hourOfDay: { $hour: '$timestamp' },

          // Calculate time since previous reading
          timeIndex: {
            $divide: [
              { $subtract: ['$timestamp', startTime] },
              1000 * 60  // Convert to minutes
            ]
          }
        }
      },

      // Stage 3: Group by time buckets and sensor type for detailed analytics
      {
        $group: {
          _id: {
            sensorType: '$metadata.sensorType',
            hourBucket: '$hourBucket',
            deviceId: '$metadata.deviceId'
          },

          // Basic statistical measures
          readingCount: { $sum: 1 },
          avgValue: { $avg: '$value' },
          minValue: { $min: '$value' },
          maxValue: { $max: '$value' },
          stdDev: { $stdDevPop: '$value' },

          // Percentile calculations for distribution analysis
          valueArray: { $push: '$value' },

          // Quality metrics
          avgQualityScore: { $avg: '$metadata.qualityScore' },
          highQualityCount: {
            $sum: {
              $cond: [{ $gt: ['$metadata.qualityScore', 90] }, 1, 0]
            }
          },

          // Operational metrics
          avgBatteryLevel: { $avg: '$metadata.batteryLevel' },
          avgSignalStrength: { $avg: '$metadata.signalStrength' },
          avgNetworkLatency: { $avg: '$metadata.networkLatency' },

          // Environmental context
          avgAmbientTemp: { $avg: '$metadata.environment.temperature' },
          avgAmbientHumidity: { $avg: '$metadata.environment.humidity' },
          avgAmbientPressure: { $avg: '$metadata.environment.pressure' },

          // Time distribution analysis
          firstReading: { $min: '$timestamp' },
          lastReading: { $max: '$timestamp' },
          timeSpread: { $stdDevPop: '$timeIndex' },

          // Data completeness tracking
          uniqueMinutes: { $addToSet: '$minuteBucket' },

          // Trend analysis preparation
          timeValuePairs: {
            $push: {
              time: '$timeIndex',
              value: '$value'
            }
          }
        }
      },

      // Stage 4: Calculate advanced analytics and derived metrics
      {
        $addFields: {
          // Statistical analysis
          valueRange: { $subtract: ['$maxValue', '$minValue'] },
          coefficientOfVariation: {
            $cond: {
              if: { $gt: ['$avgValue', 0] },
              then: { $divide: ['$stdDev', '$avgValue'] },
              else: 0
            }
          },

          // Percentile calculations
          median: {
            $arrayElemAt: [
              '$valueArray',
              { $floor: { $multiply: [{ $size: '$valueArray' }, 0.5] } }
            ]
          },
          p95: {
            $arrayElemAt: [
              '$valueArray',
              { $floor: { $multiply: [{ $size: '$valueArray' }, 0.95] } }
            ]
          },
          p99: {
            $arrayElemAt: [
              '$valueArray',
              { $floor: { $multiply: [{ $size: '$valueArray' }, 0.99] } }
            ]
          },

          // Data quality assessment
          qualityRatio: {
            $divide: ['$highQualityCount', '$readingCount']
          },

          // Data completeness calculation
          dataCompleteness: {
            $divide: [
              { $size: '$uniqueMinutes' },
              {
                $divide: [
                  { $subtract: ['$lastReading', '$firstReading'] },
                  60000  // Minutes in milliseconds
                ]
              }
            ]
          },

          // Operational health scoring
          operationalScore: {
            $multiply: [
              { $ifNull: ['$avgBatteryLevel', 100] },
              { $divide: [{ $ifNull: ['$avgSignalStrength', 100] }, 100] },
              {
                $cond: {
                  if: { $gt: [{ $ifNull: ['$avgNetworkLatency', 0] }, 0] },
                  then: { $divide: [1000, { $add: ['$avgNetworkLatency', 1000] }] },
                  else: 1
                }
              }
            ]
          },

          // Trend analysis using linear regression
          trendSlope: {
            $let: {
              vars: {
                n: { $size: '$timeValuePairs' },
                sumX: {
                  $reduce: {
                    input: '$timeValuePairs',
                    initialValue: 0,
                    in: { $add: ['$$value', '$$this.time'] }
                  }
                },
                sumY: {
                  $reduce: {
                    input: '$timeValuePairs',
                    initialValue: 0,
                    in: { $add: ['$$value', '$$this.value'] }
                  }
                },
                sumXY: {
                  $reduce: {
                    input: '$timeValuePairs',
                    initialValue: 0,
                    in: { $add: ['$$value', { $multiply: ['$$this.time', '$$this.value'] }] }
                  }
                },
                sumX2: {
                  $reduce: {
                    input: '$timeValuePairs',
                    initialValue: 0,
                    in: { $add: ['$$value', { $multiply: ['$$this.time', '$$this.time'] }] }
                  }
                }
              },
              in: {
                $cond: {
                  if: {
                    $gt: [
                      { $subtract: [{ $multiply: ['$$n', '$$sumX2'] }, { $multiply: ['$$sumX', '$$sumX'] }] },
                      0
                    ]
                  },
                  then: {
                    $divide: [
                      { $subtract: [{ $multiply: ['$$n', '$$sumXY'] }, { $multiply: ['$$sumX', '$$sumY'] }] },
                      { $subtract: [{ $multiply: ['$$n', '$$sumX2'] }, { $multiply: ['$$sumX', '$$sumX'] }] }
                    ]
                  },
                  else: 0
                }
              }
            }
          }
        }
      },

      // Stage 5: Anomaly detection and alerting
      {
        $addFields: {
          // Anomaly flags based on statistical analysis
          hasHighVariance: { $gt: ['$coefficientOfVariation', 0.5] },
          hasDataGaps: { $lt: ['$dataCompleteness', 0.85] },
          hasLowQuality: { $lt: ['$qualityRatio', 0.9] },
          hasOperationalIssues: { $lt: ['$operationalScore', 50] },
          hasSignificantTrend: { $gt: [{ $abs: '$trendSlope' }, 0.1] },

          // Performance classification
          performanceCategory: {
            $switch: {
              branches: [
                {
                  case: {
                    $and: [
                      { $gt: ['$qualityRatio', 0.95] },
                      { $gt: ['$dataCompleteness', 0.95] },
                      { $gt: ['$operationalScore', 80] }
                    ]
                  },
                  then: 'excellent'
                },
                {
                  case: {
                    $and: [
                      { $gt: ['$qualityRatio', 0.90] },
                      { $gt: ['$dataCompleteness', 0.90] },
                      { $gt: ['$operationalScore', 60] }
                    ]
                  },
                  then: 'good'
                },
                {
                  case: {
                    $and: [
                      { $gt: ['$qualityRatio', 0.75] },
                      { $gt: ['$dataCompleteness', 0.75] }
                    ]
                  },
                  then: 'acceptable'
                }
              ],
              default: 'poor'
            }
          },

          // Alert priority calculation
          alertPriority: {
            $cond: {
              if: {
                $or: [
                  { $lt: ['$operationalScore', 25] },
                  { $lt: ['$dataCompleteness', 0.5] },
                  { $gt: [{ $abs: '$trendSlope' }, 1.0] }
                ]
              },
              then: 'critical',
              else: {
                $cond: {
                  if: {
                    $or: [
                      { $lt: ['$operationalScore', 50] },
                      { $lt: ['$qualityRatio', 0.8] },
                      { $gt: ['$coefficientOfVariation', 0.8] }
                    ]
                  },
                  then: 'warning',
                  else: 'normal'
                }
              }
            }
          }
        }
      },

      // Stage 6: Final projection with comprehensive metrics
      {
        $project: {
          _id: 1,
          deviceId: '$_id.deviceId',
          sensorType: '$_id.sensorType',
          hourBucket: '$_id.hourBucket',

          // Core statistics
          readingCount: 1,
          avgValue: { $round: ['$avgValue', 3] },
          minValue: { $round: ['$minValue', 3] },
          maxValue: { $round: ['$maxValue', 3] },
          stdDev: { $round: ['$stdDev', 3] },
          valueRange: { $round: ['$valueRange', 3] },
          coefficientOfVariation: { $round: ['$coefficientOfVariation', 3] },

          // Distribution metrics
          median: { $round: ['$median', 3] },
          p95: { $round: ['$p95', 3] },
          p99: { $round: ['$p99', 3] },

          // Quality and completeness
          qualityRatio: { $round: ['$qualityRatio', 3] },
          dataCompleteness: { $round: ['$dataCompleteness', 3] },

          // Operational metrics
          operationalScore: { $round: ['$operationalScore', 1] },
          avgBatteryLevel: { $round: ['$avgBatteryLevel', 1] },
          avgSignalStrength: { $round: ['$avgSignalStrength', 1] },
          avgNetworkLatency: { $round: ['$avgNetworkLatency', 1] },

          // Environmental context
          avgAmbientTemp: { $round: ['$avgAmbientTemp', 2] },
          avgAmbientHumidity: { $round: ['$avgAmbientHumidity', 2] },
          avgAmbientPressure: { $round: ['$avgAmbientPressure', 2] },

          // Trend analysis
          trendSlope: { $round: ['$trendSlope', 6] },
          timeSpread: { $round: ['$timeSpread', 2] },

          // Time range
          firstReading: 1,
          lastReading: 1,
          analysisHours: {
            $round: [
              { $divide: [{ $subtract: ['$lastReading', '$firstReading'] }, 3600000] },
              2
            ]
          },

          // Classification and alerts
          performanceCategory: 1,
          alertPriority: 1,

          // Anomaly flags
          anomalies: {
            highVariance: '$hasHighVariance',
            dataGaps: '$hasDataGaps',
            lowQuality: '$hasLowQuality',
            operationalIssues: '$hasOperationalIssues',
            significantTrend: '$hasSignificantTrend'
          }
        }
      },

      // Stage 7: Sort by time bucket for temporal analysis
      {
        $sort: {
          sensorType: 1,
          hourBucket: 1
        }
      }
    ];

    // Execute comprehensive time-series analytics
    const analyticsStart = Date.now();
    const results = await sensorReadings.aggregate(pipeline, {
      allowDiskUse: true,
      hint: { 'metadata.deviceId': 1, 'timestamp': 1 }
    }).toArray();

    const analyticsTime = Date.now() - analyticsStart;

    console.log(`Time-series analytics completed in ${analyticsTime}ms for ${results.length} time buckets`);

    // Generate summary insights
    const insights = this.generateAnalyticsInsights(results, timeRange);

    return {
      deviceId: deviceId,
      analysisType: analysisType,
      timeRange: {
        start: startTime,
        end: endTime,
        hours: timeRange.hours
      },
      executionTime: analyticsTime,
      bucketCount: results.length,
      hourlyData: results,
      insights: insights
    };
  }

  generateAnalyticsInsights(analyticsResults, timeRange) {
    const insights = {
      summary: {},
      trends: {},
      quality: {},
      alerts: [],
      recommendations: []
    };

    if (analyticsResults.length === 0) {
      insights.alerts.push({
        type: 'no_data',
        severity: 'critical',
        message: 'No sensor data found for the specified time range and quality criteria'
      });
      return insights;
    }

    // Summary statistics
    const totalReadings = analyticsResults.reduce((sum, r) => sum + r.readingCount, 0);
    const avgQuality = analyticsResults.reduce((sum, r) => sum + r.qualityRatio, 0) / analyticsResults.length;
    const avgCompleteness = analyticsResults.reduce((sum, r) => sum + r.dataCompleteness, 0) / analyticsResults.length;
    const avgOperationalScore = analyticsResults.reduce((sum, r) => sum + r.operationalScore, 0) / analyticsResults.length;

    insights.summary = {
      totalReadings: totalReadings,
      avgReadingsPerHour: Math.round(totalReadings / timeRange.hours),
      avgQualityRatio: Math.round(avgQuality * 100) / 100,
      avgDataCompleteness: Math.round(avgCompleteness * 100) / 100,
      avgOperationalScore: Math.round(avgOperationalScore * 100) / 100,
      sensorTypes: [...new Set(analyticsResults.map(r => r.sensorType))],
      performanceDistribution: {
        excellent: analyticsResults.filter(r => r.performanceCategory === 'excellent').length,
        good: analyticsResults.filter(r => r.performanceCategory === 'good').length,
        acceptable: analyticsResults.filter(r => r.performanceCategory === 'acceptable').length,
        poor: analyticsResults.filter(r => r.performanceCategory === 'poor').length
      }
    };

    // Trend analysis
    const trendingUp = analyticsResults.filter(r => r.trendSlope > 0.05).length;
    const trendingDown = analyticsResults.filter(r => r.trendSlope < -0.05).length;
    const stable = analyticsResults.length - trendingUp - trendingDown;

    insights.trends = {
      trendingUp: trendingUp,
      trendingDown: trendingDown,
      stable: stable,
      strongestUpTrend: Math.max(...analyticsResults.map(r => r.trendSlope)),
      strongestDownTrend: Math.min(...analyticsResults.map(r => r.trendSlope)),
      mostVolatile: Math.max(...analyticsResults.map(r => r.coefficientOfVariation))
    };

    // Quality analysis
    const lowQualityBuckets = analyticsResults.filter(r => r.qualityRatio < 0.8);
    const dataGapBuckets = analyticsResults.filter(r => r.dataCompleteness < 0.8);

    insights.quality = {
      lowQualityBuckets: lowQualityBuckets.length,
      dataGapBuckets: dataGapBuckets.length,
      worstQuality: Math.min(...analyticsResults.map(r => r.qualityRatio)),
      bestQuality: Math.max(...analyticsResults.map(r => r.qualityRatio)),
      worstCompleteness: Math.min(...analyticsResults.map(r => r.dataCompleteness)),
      bestCompleteness: Math.max(...analyticsResults.map(r => r.dataCompleteness))
    };

    // Generate alerts based on analysis
    const criticalAlerts = analyticsResults.filter(r => r.alertPriority === 'critical');
    const warningAlerts = analyticsResults.filter(r => r.alertPriority === 'warning');

    criticalAlerts.forEach(result => {
      insights.alerts.push({
        type: 'critical_performance',
        severity: 'critical',
        sensorType: result.sensorType,
        hourBucket: result.hourBucket,
        message: `Critical performance issues detected: ${result.performanceCategory} performance with operational score ${result.operationalScore}`
      });
    });

    warningAlerts.forEach(result => {
      insights.alerts.push({
        type: 'performance_warning',
        severity: 'warning',
        sensorType: result.sensorType,
        hourBucket: result.hourBucket,
        message: `Performance warning: ${result.performanceCategory} performance with quality ratio ${result.qualityRatio}`
      });
    });

    // Generate recommendations
    if (avgQuality < 0.9) {
      insights.recommendations.push('Consider sensor calibration or replacement due to low quality scores');
    }

    if (avgCompleteness < 0.85) {
      insights.recommendations.push('Investigate data transmission issues causing data gaps');
    }

    if (avgOperationalScore < 60) {
      insights.recommendations.push('Review device operational status - low battery or connectivity issues detected');
    }

    if (insights.trends.trendingDown > insights.trends.trendingUp * 2) {
      insights.recommendations.push('Multiple sensors showing downward trends - investigate environmental factors');
    }

    return insights;
  }

  async performRealTimeAggregation(collectionName, windowSize = '5m') {
    console.log(`Performing real-time aggregation with ${windowSize} window...`);

    const collection = this.collections.get(collectionName);
    const windowMs = this.parseTimeWindow(windowSize);
    const currentTime = new Date();
    const windowStart = new Date(currentTime.getTime() - windowMs);

    const pipeline = [
      // Match recent data within the time window
      {
        $match: {
          timestamp: { $gte: windowStart, $lte: currentTime }
        }
      },

      // Add time bucketing for sub-window analysis
      {
        $addFields: {
          timeBucket: {
            $dateTrunc: {
              date: '$timestamp',
              unit: 'minute'
            }
          }
        }
      },

      // Group by metadata and time bucket
      {
        $group: {
          _id: {
            metaKey: '$metadata',
            timeBucket: '$timeBucket'
          },
          count: { $sum: 1 },
          avgValue: { $avg: '$value' },
          minValue: { $min: '$value' },
          maxValue: { $max: '$value' },
          latestReading: { $max: '$timestamp' },
          values: { $push: '$value' }
        }
      },

      // Calculate real-time statistics
      {
        $addFields: {
          stdDev: { $stdDevPop: '$values' },
          variance: { $pow: [{ $stdDevPop: '$values' }, 2] },
          range: { $subtract: ['$maxValue', '$minValue'] },

          // Real-time anomaly detection
          isAnomalous: {
            $let: {
              vars: {
                mean: '$avgValue',
                std: { $stdDevPop: '$values' }
              },
              in: {
                $gt: [
                  {
                    $size: {
                      $filter: {
                        input: '$values',
                        cond: {
                          $gt: [
                            { $abs: { $subtract: ['$$this', '$$mean'] } },
                            { $multiply: ['$$std', 2] }
                          ]
                        }
                      }
                    }
                  },
                  { $multiply: [{ $size: '$values' }, 0.05] }  // More than 5% outliers
                ]
              }
            }
          }
        }
      },

      // Sort by latest readings first
      {
        $sort: { 'latestReading': -1 }
      },

      // Limit to prevent overwhelming results
      {
        $limit: 100
      }
    ];

    const results = await collection.aggregate(pipeline).toArray();

    return {
      windowSize: windowSize,
      windowStart: windowStart,
      windowEnd: currentTime,
      aggregations: results,
      totalBuckets: results.length
    };
  }

  parseTimeWindow(windowString) {
    const match = windowString.match(/^(\d+)([smhd])$/);
    if (!match) return 5 * 60 * 1000; // Default 5 minutes

    const value = parseInt(match[1]);
    const unit = match[2];

    const multipliers = {
      's': 1000,
      'm': 60 * 1000,
      'h': 60 * 60 * 1000,
      'd': 24 * 60 * 60 * 1000
    };

    return value * multipliers[unit];
  }

  async optimizeTimeSeriesPerformance() {
    console.log('Optimizing time-series collection performance...');

    const optimizations = [];

    for (const [collectionName, collection] of this.collections) {
      console.log(`Optimizing collection: ${collectionName}`);

      // Get collection statistics
      const stats = await this.db.runCommand({ collStats: collectionName });

      // Check for optimal bucketing configuration
      if (stats.timeseries) {
        const bucketInfo = {
          granularity: stats.timeseries.granularity,
          bucketCount: stats.timeseries.numBuckets,
          avgBucketSize: stats.size / (stats.timeseries.numBuckets || 1),
          compressionRatio: stats.timeseries.compressionRatio || 'N/A'
        };

        optimizations.push({
          collection: collectionName,
          type: 'bucketing_analysis',
          current: bucketInfo,
          recommendations: this.generateBucketingRecommendations(bucketInfo)
        });
      }

      // Analyze index usage
      const indexStats = await collection.aggregate([{ $indexStats: {} }]).toArray();
      const indexRecommendations = this.analyzeIndexUsage(indexStats);

      optimizations.push({
        collection: collectionName,
        type: 'index_analysis',
        indexes: indexStats,
        recommendations: indexRecommendations
      });

      // Check for data retention optimization opportunities
      const oldestDocument = await collection.findOne({}, { sort: { timestamp: 1 } });
      const newestDocument = await collection.findOne({}, { sort: { timestamp: -1 } });

      if (oldestDocument && newestDocument) {
        const dataSpan = newestDocument.timestamp - oldestDocument.timestamp;
        const dataSpanDays = dataSpan / (1000 * 60 * 60 * 24);

        optimizations.push({
          collection: collectionName,
          type: 'retention_analysis',
          dataSpanDays: Math.round(dataSpanDays),
          oldestDocument: oldestDocument.timestamp,
          newestDocument: newestDocument.timestamp,
          recommendations: dataSpanDays > 365 ? 
            ['Consider implementing data archival strategy for data older than 1 year'] : []
        });
      }
    }

    return optimizations;
  }

  generateBucketingRecommendations(bucketInfo) {
    const recommendations = [];

    if (bucketInfo.avgBucketSize > 10 * 1024 * 1024) { // 10MB
      recommendations.push('Consider reducing granularity - buckets are very large');
    }

    if (bucketInfo.avgBucketSize < 64 * 1024) { // 64KB
      recommendations.push('Consider increasing granularity - buckets are too small for optimal compression');
    }

    if (bucketInfo.bucketCount > 1000000) {
      recommendations.push('High bucket count may impact query performance - review time-series collection design');
    }

    return recommendations;
  }

  analyzeIndexUsage(indexStats) {
    const recommendations = [];
    const lowUsageThreshold = 100;

    indexStats.forEach(stat => {
      if (stat.accesses && stat.accesses.ops < lowUsageThreshold) {
        recommendations.push(`Consider dropping low-usage index: ${stat.name} (${stat.accesses.ops} operations)`);
      }
    });

    return recommendations;
  }
}

// Benefits of MongoDB Time-Series Collections:
// - Automatic data bucketing and compression optimized for temporal data patterns
// - Built-in indexing strategies designed for time-range and metadata queries
// - Up to 90% storage space reduction compared to regular collections
// - Optimized aggregation pipelines with time-aware query planning
// - Native support for high-frequency data ingestion with minimal overhead
// - Automatic handling of out-of-order insertions common in IoT scenarios
// - Integration with MongoDB's change streams for real-time analytics
// - Support for complex metadata structures while maintaining query performance
// - Time-aware sharding strategies for horizontal scaling
// - Native compatibility with BI and analytics tools through standard MongoDB interfaces

module.exports = {
  TimeSeriesDataManager
};

Understanding MongoDB Time-Series Collection Architecture

Advanced Time-Series Optimization Strategies

Implement sophisticated time-series patterns for maximum performance and storage efficiency:

// Advanced time-series optimization and real-time analytics patterns
class TimeSeriesOptimizer {
  constructor(db) {
    this.db = db;
    this.performanceMetrics = new Map();
    this.compressionStrategies = {
      zstd: { level: 9, ratio: 0.85 },
      snappy: { level: 1, ratio: 0.75 },
      lz4: { level: 1, ratio: 0.70 }
    };
  }

  async optimizeIngestionPipeline(deviceTypes) {
    console.log('Optimizing time-series ingestion pipeline for device types:', deviceTypes);

    const optimizations = {};

    for (const deviceType of deviceTypes) {
      // Analyze ingestion patterns for each device type
      const ingestionAnalysis = await this.analyzeIngestionPatterns(deviceType);

      // Determine optimal collection configuration
      const optimalConfig = this.calculateOptimalConfiguration(ingestionAnalysis);

      // Create optimized collection if needed
      const collectionName = `ts_${deviceType.toLowerCase().replace(/[^a-z0-9]/g, '_')}`;

      try {
        await this.db.createCollection(collectionName, {
          timeseries: {
            timeField: 'timestamp',
            metaField: 'device',
            granularity: optimalConfig.granularity
          },
          storageEngine: {
            wiredTiger: {
              configString: `block_compressor=${optimalConfig.compression}`
            }
          }
        });

        // Create optimal indexes for the device type
        await this.createOptimalIndexes(collectionName, ingestionAnalysis.queryPatterns);

        optimizations[deviceType] = {
          collection: collectionName,
          configuration: optimalConfig,
          expectedPerformance: {
            ingestionRate: optimalConfig.estimatedIngestionRate,
            compressionRatio: optimalConfig.estimatedCompressionRatio,
            queryPerformance: optimalConfig.estimatedQueryPerformance
          }
        };

      } catch (error) {
        console.warn(`Collection ${collectionName} already exists or creation failed:`, error.message);
      }
    }

    return optimizations;
  }

  async analyzeIngestionPatterns(deviceType) {
    // Simulate analysis of historical ingestion patterns
    const patterns = {
      temperature: {
        avgFrequency: 60, // seconds
        avgBatchSize: 1,
        dataVariability: 0.2,
        queryPatterns: ['recent_values', 'hourly_aggregates', 'anomaly_detection']
      },
      pressure: {
        avgFrequency: 30,
        avgBatchSize: 1,
        dataVariability: 0.1,
        queryPatterns: ['trend_analysis', 'threshold_monitoring']
      },
      vibration: {
        avgFrequency: 1, // High frequency
        avgBatchSize: 100,
        dataVariability: 0.8,
        queryPatterns: ['fft_analysis', 'peak_detection', 'real_time_monitoring']
      },
      gps: {
        avgFrequency: 10,
        avgBatchSize: 1,
        dataVariability: 0.5,
        queryPatterns: ['geospatial_queries', 'route_analysis', 'location_history']
      }
    };

    return patterns[deviceType] || patterns.temperature;
  }

  calculateOptimalConfiguration(ingestionAnalysis) {
    const { avgFrequency, avgBatchSize, dataVariability, queryPatterns } = ingestionAnalysis;

    // Determine optimal granularity based on frequency
    let granularity;
    if (avgFrequency <= 1) {
      granularity = 'seconds';
    } else if (avgFrequency <= 60) {
      granularity = 'minutes';
    } else {
      granularity = 'hours';
    }

    // Choose compression strategy based on data characteristics
    let compression;
    if (dataVariability < 0.3) {
      compression = 'zstd'; // High compression for low variability data
    } else if (dataVariability < 0.6) {
      compression = 'snappy'; // Balanced compression/speed
    } else {
      compression = 'lz4'; // Fast compression for high variability
    }

    // Estimate performance characteristics
    const estimatedIngestionRate = Math.floor((3600 / avgFrequency) * avgBatchSize);
    const compressionStrategy = this.compressionStrategies[compression];

    return {
      granularity,
      compression,
      estimatedIngestionRate,
      estimatedCompressionRatio: compressionStrategy.ratio,
      estimatedQueryPerformance: this.estimateQueryPerformance(queryPatterns, granularity),
      recommendedIndexes: this.recommendIndexes(queryPatterns)
    };
  }

  estimateQueryPerformance(queryPatterns, granularity) {
    const performanceScores = {
      recent_values: granularity === 'seconds' ? 95 : granularity === 'minutes' ? 90 : 80,
      hourly_aggregates: granularity === 'minutes' ? 95 : granularity === 'hours' ? 100 : 85,
      trend_analysis: granularity === 'minutes' ? 90 : granularity === 'hours' ? 95 : 75,
      anomaly_detection: granularity === 'seconds' ? 85 : granularity === 'minutes' ? 95 : 70,
      geospatial_queries: 85,
      real_time_monitoring: granularity === 'seconds' ? 100 : granularity === 'minutes' ? 80 : 60
    };

    const avgScore = queryPatterns.reduce((sum, pattern) => 
      sum + (performanceScores[pattern] || 75), 0) / queryPatterns.length;

    return Math.round(avgScore);
  }

  recommendIndexes(queryPatterns) {
    const indexRecommendations = {
      recent_values: [{ timestamp: -1 }],
      hourly_aggregates: [{ 'device.deviceId': 1, timestamp: 1 }],
      trend_analysis: [{ 'device.sensorType': 1, timestamp: 1 }],
      anomaly_detection: [{ 'device.deviceId': 1, 'device.sensorType': 1, timestamp: 1 }],
      geospatial_queries: [{ 'device.location': '2dsphere', timestamp: 1 }],
      real_time_monitoring: [{ timestamp: -1 }, { 'device.alertLevel': 1, timestamp: -1 }]
    };

    const recommendedIndexes = new Set();
    queryPatterns.forEach(pattern => {
      if (indexRecommendations[pattern]) {
        indexRecommendations[pattern].forEach(index => 
          recommendedIndexes.add(JSON.stringify(index))
        );
      }
    });

    return Array.from(recommendedIndexes).map(indexStr => JSON.parse(indexStr));
  }

  async createOptimalIndexes(collectionName, queryPatterns) {
    const collection = this.db.collection(collectionName);
    const recommendedIndexes = this.recommendIndexes(queryPatterns);

    for (const indexSpec of recommendedIndexes) {
      try {
        await collection.createIndex(indexSpec, { background: true });
        console.log(`Created index on ${collectionName}:`, indexSpec);
      } catch (error) {
        console.warn(`Index creation failed for ${collectionName}:`, error.message);
      }
    }
  }

  async implementRealTimeStreamProcessing(collectionName, processingRules) {
    console.log(`Implementing real-time stream processing for ${collectionName}`);

    const collection = this.db.collection(collectionName);

    // Create change stream for real-time processing
    const changeStream = collection.watch([], {
      fullDocument: 'updateLookup'
    });

    const processor = {
      rules: processingRules,
      stats: {
        processed: 0,
        alerts: 0,
        errors: 0,
        startTime: new Date()
      },

      async processChange(change) {
        this.stats.processed++;

        try {
          if (change.operationType === 'insert') {
            const document = change.fullDocument;

            // Apply processing rules
            for (const rule of this.rules) {
              const result = await this.applyRule(rule, document);

              if (result.triggered) {
                await this.handleRuleTriggered(rule, document, result);
                this.stats.alerts++;
              }
            }
          }
        } catch (error) {
          console.error('Stream processing error:', error);
          this.stats.errors++;
        }
      },

      async applyRule(rule, document) {
        switch (rule.type) {
          case 'threshold':
            return {
              triggered: this.evaluateThreshold(document.value, rule.threshold, rule.operator),
              value: document.value,
              threshold: rule.threshold
            };

          case 'anomaly':
            return await this.detectAnomaly(document, rule.parameters);

          case 'trend':
            return await this.detectTrend(document, rule.parameters);

          default:
            return { triggered: false };
        }
      },

      evaluateThreshold(value, threshold, operator) {
        switch (operator) {
          case '>': return value > threshold;
          case '<': return value < threshold;
          case '>=': return value >= threshold;
          case '<=': return value <= threshold;
          case '==': return Math.abs(value - threshold) < 0.001;
          default: return false;
        }
      },

      async detectAnomaly(document, parameters) {
        // Simplified anomaly detection using recent historical data
        const recentData = await collection.find({
          'device.deviceId': document.device.deviceId,
          'device.sensorType': document.device.sensorType,
          timestamp: {
            $gte: new Date(Date.now() - parameters.windowMs),
            $lt: document.timestamp
          }
        }).limit(parameters.sampleSize).toArray();

        if (recentData.length < parameters.minSamples) {
          return { triggered: false, reason: 'insufficient_data' };
        }

        const values = recentData.map(d => d.value);
        const mean = values.reduce((sum, v) => sum + v, 0) / values.length;
        const variance = values.reduce((sum, v) => sum + Math.pow(v - mean, 2), 0) / values.length;
        const stdDev = Math.sqrt(variance);

        const zScore = Math.abs(document.value - mean) / stdDev;
        const isAnomalous = zScore > parameters.threshold;

        return {
          triggered: isAnomalous,
          zScore: zScore,
          mean: mean,
          stdDev: stdDev,
          value: document.value
        };
      },

      async detectTrend(document, parameters) {
        // Simplified trend detection using linear regression
        const trendData = await collection.find({
          'device.deviceId': document.device.deviceId,
          'device.sensorType': document.device.sensorType,
          timestamp: {
            $gte: new Date(Date.now() - parameters.windowMs)
          }
        }).sort({ timestamp: 1 }).toArray();

        if (trendData.length < parameters.minPoints) {
          return { triggered: false, reason: 'insufficient_data' };
        }

        // Calculate trend slope
        const n = trendData.length;
        const sumX = trendData.reduce((sum, d, i) => sum + i, 0);
        const sumY = trendData.reduce((sum, d) => sum + d.value, 0);
        const sumXY = trendData.reduce((sum, d, i) => sum + i * d.value, 0);
        const sumX2 = trendData.reduce((sum, d, i) => sum + i * i, 0);

        const slope = (n * sumXY - sumX * sumY) / (n * sumX2 - sumX * sumX);
        const isSignificant = Math.abs(slope) > parameters.slopeThreshold;

        return {
          triggered: isSignificant,
          slope: slope,
          direction: slope > 0 ? 'increasing' : 'decreasing',
          dataPoints: n
        };
      },

      async handleRuleTriggered(rule, document, result) {
        console.log(`Rule triggered: ${rule.name}`, {
          device: document.device.deviceId,
          sensor: document.device.sensorType,
          value: document.value,
          timestamp: document.timestamp,
          result: result
        });

        // Store alert
        await this.db.collection('alerts').insertOne({
          ruleName: rule.name,
          ruleType: rule.type,
          deviceId: document.device.deviceId,
          sensorType: document.device.sensorType,
          value: document.value,
          timestamp: document.timestamp,
          triggerResult: result,
          severity: rule.severity || 'medium',
          createdAt: new Date()
        });

        // Execute actions if configured
        if (rule.actions) {
          for (const action of rule.actions) {
            await this.executeAction(action, document, result);
          }
        }
      },

      async executeAction(action, document, result) {
        switch (action.type) {
          case 'webhook':
            // Simulate webhook call
            console.log(`Webhook action: ${action.url}`, { document, result });
            break;

          case 'email':
            console.log(`Email action: ${action.recipient}`, { document, result });
            break;

          case 'database':
            await this.db.collection(action.collection).insertOne({
              ...action.document,
              sourceDocument: document,
              triggerResult: result,
              createdAt: new Date()
            });
            break;
        }
      },

      getStats() {
        const runtime = Date.now() - this.stats.startTime.getTime();
        return {
          ...this.stats,
          runtimeMs: runtime,
          processingRate: this.stats.processed / (runtime / 1000),
          errorRate: this.stats.errors / this.stats.processed
        };
      }
    };

    // Set up change stream event handlers
    changeStream.on('change', async (change) => {
      await processor.processChange(change);
    });

    changeStream.on('error', (error) => {
      console.error('Change stream error:', error);
      processor.stats.errors++;
    });

    return {
      processor: processor,
      changeStream: changeStream,
      stop: () => changeStream.close()
    };
  }

  async performTimeSeriesBenchmark(collectionName, testConfig) {
    console.log(`Performing time-series benchmark on ${collectionName}`);

    const collection = this.db.collection(collectionName);
    const results = {
      ingestion: {},
      queries: {},
      aggregations: {}
    };

    // Benchmark high-frequency ingestion
    console.log('Benchmarking ingestion performance...');
    const ingestionStart = Date.now();
    const testData = this.generateBenchmarkData(testConfig.documentCount);

    const batchSize = testConfig.batchSize || 1000;
    let totalInserted = 0;

    for (let i = 0; i < testData.length; i += batchSize) {
      const batch = testData.slice(i, i + batchSize);

      try {
        const insertResult = await collection.insertMany(batch, { ordered: false });
        totalInserted += insertResult.insertedCount;
      } catch (error) {
        console.warn('Batch insertion error:', error.message);
        if (error.result && error.result.insertedCount) {
          totalInserted += error.result.insertedCount;
        }
      }
    }

    const ingestionTime = Date.now() - ingestionStart;
    results.ingestion = {
      documentsInserted: totalInserted,
      timeMs: ingestionTime,
      documentsPerSecond: Math.round(totalInserted / (ingestionTime / 1000)),
      avgBatchTime: Math.round(ingestionTime / Math.ceil(testData.length / batchSize))
    };

    // Benchmark time-range queries
    console.log('Benchmarking query performance...');
    const queryTests = [
      {
        name: 'recent_data',
        filter: { timestamp: { $gte: new Date(Date.now() - 3600000) } } // Last hour
      },
      {
        name: 'device_specific',
        filter: { 'device.deviceId': testData[0].device.deviceId }
      },
      {
        name: 'sensor_type_filter',
        filter: { 'device.sensorType': 'temperature' }
      },
      {
        name: 'complex_filter',
        filter: {
          'device.sensorType': 'temperature',
          value: { $gt: 20, $lt: 30 },
          timestamp: { $gte: new Date(Date.now() - 7200000) }
        }
      }
    ];

    results.queries = {};

    for (const queryTest of queryTests) {
      const queryStart = Date.now();
      const queryResults = await collection.find(queryTest.filter).limit(1000).toArray();
      const queryTime = Date.now() - queryStart;

      results.queries[queryTest.name] = {
        timeMs: queryTime,
        documentsReturned: queryResults.length,
        documentsPerSecond: Math.round(queryResults.length / (queryTime / 1000))
      };
    }

    // Benchmark aggregation performance
    console.log('Benchmarking aggregation performance...');
    const aggregationTests = [
      {
        name: 'hourly_averages',
        pipeline: [
          { $match: { timestamp: { $gte: new Date(Date.now() - 86400000) } } },
          {
            $group: {
              _id: {
                hour: { $dateToString: { format: '%Y-%m-%d-%H', date: '$timestamp' } },
                deviceId: '$device.deviceId',
                sensorType: '$device.sensorType'
              },
              avgValue: { $avg: '$value' },
              count: { $sum: 1 }
            }
          }
        ]
      },
      {
        name: 'device_statistics',
        pipeline: [
          { $match: { timestamp: { $gte: new Date(Date.now() - 86400000) } } },
          {
            $group: {
              _id: '$device.deviceId',
              sensors: { $addToSet: '$device.sensorType' },
              totalReadings: { $sum: 1 },
              avgValue: { $avg: '$value' },
              minValue: { $min: '$value' },
              maxValue: { $max: '$value' }
            }
          }
        ]
      },
      {
        name: 'time_series_bucketing',
        pipeline: [
          { $match: { timestamp: { $gte: new Date(Date.now() - 3600000) } } },
          {
            $bucket: {
              groupBy: '$value',
              boundaries: [0, 10, 20, 30, 40, 50, 100],
              default: 'other',
              output: {
                count: { $sum: 1 },
                avgTimestamp: { $avg: '$timestamp' }
              }
            }
          }
        ]
      }
    ];

    results.aggregations = {};

    for (const aggTest of aggregationTests) {
      const aggStart = Date.now();
      const aggResults = await collection.aggregate(aggTest.pipeline, { allowDiskUse: true }).toArray();
      const aggTime = Date.now() - aggStart;

      results.aggregations[aggTest.name] = {
        timeMs: aggTime,
        resultsReturned: aggResults.length
      };
    }

    return results;
  }

  generateBenchmarkData(count) {
    const deviceIds = Array.from({ length: 10 }, (_, i) => `device_${i.toString().padStart(3, '0')}`);
    const sensorTypes = ['temperature', 'humidity', 'pressure', 'vibration', 'light'];
    const baseTimestamp = Date.now() - (count * 1000); // Spread over time

    return Array.from({ length: count }, (_, i) => ({
      timestamp: new Date(baseTimestamp + i * 1000 + Math.random() * 1000),
      value: Math.random() * 100,
      device: {
        deviceId: deviceIds[Math.floor(Math.random() * deviceIds.length)],
        sensorType: sensorTypes[Math.floor(Math.random() * sensorTypes.length)],
        location: {
          type: 'Point',
          coordinates: [
            -74.0060 + (Math.random() - 0.5) * 0.1,
            40.7128 + (Math.random() - 0.5) * 0.1
          ]
        },
        batteryLevel: Math.random() * 100,
        signalStrength: Math.random() * 100
      }
    }));
  }
}

SQL-Style Time-Series Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB time-series collections and temporal operations:

-- QueryLeaf time-series operations with SQL-familiar syntax

-- Create time-series table with optimal configuration
CREATE TABLE sensor_readings (
  timestamp TIMESTAMP NOT NULL,
  value NUMERIC(15,6) NOT NULL,
  device_id VARCHAR(50) NOT NULL,
  sensor_type VARCHAR(50) NOT NULL,
  location GEOGRAPHY(POINT),
  quality_score INTEGER,
  metadata JSONB
) WITH (
  time_series = true,
  time_field = 'timestamp',
  meta_field = 'device_metadata',
  granularity = 'minutes',
  compression = 'zstd'
);

-- High-frequency sensor data insertion optimized for time-series
INSERT INTO sensor_readings (
  timestamp, value, device_id, sensor_type, location, quality_score, metadata
)
SELECT 
  NOW() - (generate_series * INTERVAL '1 second') as timestamp,
  RANDOM() * 100 as value,
  'device_' || LPAD((generate_series % 100)::text, 3, '0') as device_id,
  CASE (generate_series % 5)
    WHEN 0 THEN 'temperature'
    WHEN 1 THEN 'humidity'
    WHEN 2 THEN 'pressure'
    WHEN 3 THEN 'vibration'
    ELSE 'light'
  END as sensor_type,
  ST_Point(
    -74.0060 + (RANDOM() - 0.5) * 0.1,
    40.7128 + (RANDOM() - 0.5) * 0.1
  ) as location,
  (RANDOM() * 100)::integer as quality_score,
  JSON_BUILD_OBJECT(
    'firmware_version', '2.1.' || (generate_series % 10)::text,
    'battery_level', (RANDOM() * 100)::integer,
    'signal_strength', (RANDOM() * 100)::integer,
    'calibration_date', NOW() - (RANDOM() * 365 || ' days')::interval
  ) as metadata
FROM generate_series(1, 100000) as generate_series;

-- Time-series analytics with window functions and temporal aggregations
WITH time_buckets AS (
  SELECT 
    device_id,
    sensor_type,
    DATE_TRUNC('hour', timestamp) as hour_bucket,

    -- MongoDB time-series optimized aggregations
    COUNT(*) as reading_count,
    AVG(value) as avg_value,
    MIN(value) as min_value,
    MAX(value) as max_value,
    STDDEV(value) as std_deviation,

    -- Percentile functions for distribution analysis
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value) as median,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY value) as p95,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY value) as p99,

    -- Quality metrics using JSON functions
    AVG((metadata->>'quality_score')::numeric) as avg_quality,
    AVG((metadata->>'battery_level')::numeric) as avg_battery,
    AVG((metadata->>'signal_strength')::numeric) as avg_signal,

    -- Time-series specific calculations
    COUNT(DISTINCT DATE_TRUNC('minute', timestamp)) as minutes_with_data,
    (COUNT(DISTINCT DATE_TRUNC('minute', timestamp)) / 60.0 * 100) as completeness_percent,

    -- Geospatial analytics
    ST_Centroid(ST_Collect(location)) as avg_location,
    ST_ConvexHull(ST_Collect(location)) as reading_area,

    -- Array aggregation for detailed analysis
    ARRAY_AGG(value ORDER BY timestamp) as value_sequence,
    ARRAY_AGG(timestamp ORDER BY timestamp) as timestamp_sequence

  FROM sensor_readings
  WHERE timestamp >= NOW() - INTERVAL '24 hours'
    AND quality_score > 70
  GROUP BY device_id, sensor_type, DATE_TRUNC('hour', timestamp)
),

trend_analysis AS (
  SELECT 
    tb.*,

    -- Time-series trend calculation using linear regression
    REGR_SLOPE(
      (row_number() OVER (PARTITION BY device_id, sensor_type ORDER BY hour_bucket))::numeric,
      avg_value
    ) OVER (
      PARTITION BY device_id, sensor_type 
      ORDER BY hour_bucket 
      ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) as trend_slope,

    -- Moving averages for smoothing
    AVG(avg_value) OVER (
      PARTITION BY device_id, sensor_type 
      ORDER BY hour_bucket 
      ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING
    ) as smoothed_avg,

    -- Volatility analysis
    STDDEV(avg_value) OVER (
      PARTITION BY device_id, sensor_type 
      ORDER BY hour_bucket 
      ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
    ) as volatility_6h,

    -- Change detection
    LAG(avg_value, 1) OVER (
      PARTITION BY device_id, sensor_type 
      ORDER BY hour_bucket
    ) as prev_hour_avg,

    LAG(avg_value, 24) OVER (
      PARTITION BY device_id, sensor_type 
      ORDER BY hour_bucket
    ) as same_hour_yesterday,

    -- Anomaly scoring based on historical patterns
    (avg_value - AVG(avg_value) OVER (
      PARTITION BY device_id, sensor_type 
      ORDER BY hour_bucket 
      ROWS BETWEEN 23 PRECEDING AND 1 PRECEDING
    )) / NULLIF(STDDEV(avg_value) OVER (
      PARTITION BY device_id, sensor_type 
      ORDER BY hour_bucket 
      ROWS BETWEEN 23 PRECEDING AND 1 PRECEDING
    ), 0) as z_score

  FROM time_buckets tb
),

device_health_analysis AS (
  SELECT 
    ta.device_id,
    ta.sensor_type,
    ta.hour_bucket,
    ta.reading_count,
    ta.avg_value,
    ta.median,
    ta.p95,
    ta.completeness_percent,

    -- Trend classification
    CASE 
      WHEN ta.trend_slope > 0.1 THEN 'increasing'
      WHEN ta.trend_slope < -0.1 THEN 'decreasing'
      ELSE 'stable'
    END as trend_direction,

    -- Change analysis
    ROUND((ta.avg_value - ta.prev_hour_avg)::numeric, 3) as hour_over_hour_change,
    ROUND(((ta.avg_value - ta.prev_hour_avg) / NULLIF(ta.prev_hour_avg, 0) * 100)::numeric, 2) as hour_over_hour_pct,

    ROUND((ta.avg_value - ta.same_hour_yesterday)::numeric, 3) as day_over_day_change,
    ROUND(((ta.avg_value - ta.same_hour_yesterday) / NULLIF(ta.same_hour_yesterday, 0) * 100)::numeric, 2) as day_over_day_pct,

    -- Anomaly detection
    ROUND(ta.z_score::numeric, 3) as anomaly_score,
    CASE 
      WHEN ABS(ta.z_score) > 3 THEN 'critical'
      WHEN ABS(ta.z_score) > 2 THEN 'warning'
      ELSE 'normal'
    END as anomaly_level,

    -- Performance scoring
    CASE 
      WHEN ta.completeness_percent > 95 AND ta.avg_quality > 90 THEN 'excellent'
      WHEN ta.completeness_percent > 85 AND ta.avg_quality > 80 THEN 'good'
      WHEN ta.completeness_percent > 70 AND ta.avg_quality > 70 THEN 'acceptable'
      ELSE 'poor'
    END as data_quality,

    -- Operational health
    ROUND(ta.avg_battery::numeric, 1) as avg_battery_level,
    ROUND(ta.avg_signal::numeric, 1) as avg_signal_strength,

    CASE 
      WHEN ta.avg_battery > 80 AND ta.avg_signal > 80 THEN 'healthy'
      WHEN ta.avg_battery > 50 AND ta.avg_signal > 60 THEN 'degraded'
      ELSE 'critical'
    END as operational_status,

    -- Geographic analysis
    ST_X(ta.avg_location) as avg_longitude,
    ST_Y(ta.avg_location) as avg_latitude,
    ST_Area(ta.reading_area::geography) / 1000000 as coverage_area_km2

  FROM trend_analysis ta
),

alert_generation AS (
  SELECT 
    dha.*,

    -- Generate alerts based on multiple criteria
    CASE 
      WHEN dha.anomaly_level = 'critical' AND dha.operational_status = 'critical' THEN 'CRITICAL'
      WHEN dha.anomaly_level IN ('critical', 'warning') OR dha.operational_status = 'critical' THEN 'HIGH' 
      WHEN dha.data_quality = 'poor' OR dha.operational_status = 'degraded' THEN 'MEDIUM'
      WHEN ABS(dha.day_over_day_pct) > 50 THEN 'MEDIUM'
      ELSE 'LOW'
    END as alert_priority,

    -- Alert message generation
    CONCAT_WS('; ',
      CASE WHEN dha.anomaly_level = 'critical' THEN 'Anomaly detected (z-score: ' || dha.anomaly_score || ')' END,
      CASE WHEN dha.operational_status = 'critical' THEN 'Operational issues (battery: ' || dha.avg_battery_level || '%, signal: ' || dha.avg_signal_strength || '%)' END,
      CASE WHEN dha.data_quality = 'poor' THEN 'Poor data quality (' || dha.completeness_percent || '% completeness)' END,
      CASE WHEN ABS(dha.day_over_day_pct) > 50 THEN 'Significant day-over-day change: ' || dha.day_over_day_pct || '%' END
    ) as alert_message,

    -- Recommended actions
    ARRAY_REMOVE(ARRAY[
      CASE WHEN dha.avg_battery_level < 20 THEN 'Replace battery' END,
      CASE WHEN dha.avg_signal_strength < 30 THEN 'Check network connectivity' END,
      CASE WHEN dha.completeness_percent < 70 THEN 'Investigate data transmission issues' END,
      CASE WHEN ABS(dha.anomaly_score) > 3 THEN 'Verify sensor calibration' END,
      CASE WHEN dha.trend_direction != 'stable' THEN 'Monitor trend continuation' END
    ], NULL) as recommended_actions

  FROM device_health_analysis dha
)

SELECT 
  device_id,
  sensor_type,
  hour_bucket,
  avg_value,
  trend_direction,
  anomaly_level,
  data_quality,
  operational_status,
  alert_priority,
  alert_message,
  recommended_actions,

  -- Additional context for investigation
  JSON_BUILD_OBJECT(
    'statistics', JSON_BUILD_OBJECT(
      'median', median,
      'p95', p95,
      'completeness', completeness_percent
    ),
    'changes', JSON_BUILD_OBJECT(
      'hour_over_hour', hour_over_hour_pct,
      'day_over_day', day_over_day_pct
    ),
    'operational', JSON_BUILD_OBJECT(
      'battery_level', avg_battery_level,
      'signal_strength', avg_signal_strength
    ),
    'location', JSON_BUILD_OBJECT(
      'longitude', avg_longitude,
      'latitude', avg_latitude,
      'coverage_area_km2', coverage_area_km2
    )
  ) as analysis_context

FROM alert_generation
WHERE alert_priority IN ('CRITICAL', 'HIGH', 'MEDIUM')
ORDER BY 
  CASE alert_priority
    WHEN 'CRITICAL' THEN 1
    WHEN 'HIGH' THEN 2
    WHEN 'MEDIUM' THEN 3
    ELSE 4
  END,
  device_id, sensor_type, hour_bucket DESC;

-- Real-time streaming analytics with time windows
WITH real_time_metrics AS (
  SELECT 
    device_id,
    sensor_type,

    -- 5-minute rolling window aggregations
    AVG(value) OVER (
      PARTITION BY device_id, sensor_type 
      ORDER BY timestamp 
      RANGE BETWEEN INTERVAL '5 minutes' PRECEDING AND CURRENT ROW
    ) as avg_5m,

    COUNT(*) OVER (
      PARTITION BY device_id, sensor_type 
      ORDER BY timestamp 
      RANGE BETWEEN INTERVAL '5 minutes' PRECEDING AND CURRENT ROW
    ) as count_5m,

    -- 1-hour rolling window for trend detection
    AVG(value) OVER (
      PARTITION BY device_id, sensor_type 
      ORDER BY timestamp 
      RANGE BETWEEN INTERVAL '1 hour' PRECEDING AND CURRENT ROW
    ) as avg_1h,

    STDDEV(value) OVER (
      PARTITION BY device_id, sensor_type 
      ORDER BY timestamp 
      RANGE BETWEEN INTERVAL '1 hour' PRECEDING AND CURRENT ROW
    ) as stddev_1h,

    -- Rate of change detection
    (value - LAG(value, 10) OVER (
      PARTITION BY device_id, sensor_type 
      ORDER BY timestamp
    )) / NULLIF(EXTRACT(EPOCH FROM (timestamp - LAG(timestamp, 10) OVER (
      PARTITION BY device_id, sensor_type 
      ORDER BY timestamp
    ))), 0) as rate_of_change,

    -- Current values for comparison
    timestamp,
    value,
    quality_score,
    (metadata->>'battery_level')::numeric as battery_level

  FROM sensor_readings
  WHERE timestamp >= NOW() - INTERVAL '2 hours'
),

real_time_alerts AS (
  SELECT 
    *,

    -- Real-time anomaly detection
    CASE 
      WHEN ABS(value - avg_1h) > 3 * NULLIF(stddev_1h, 0) THEN 'ANOMALY'
      WHEN ABS(rate_of_change) > 10 THEN 'RAPID_CHANGE'  
      WHEN count_5m < 5 AND EXTRACT(EPOCH FROM (NOW() - timestamp)) < 300 THEN 'DATA_GAP'
      WHEN battery_level < 15 THEN 'LOW_BATTERY'
      WHEN quality_score < 60 THEN 'POOR_QUALITY'
      ELSE 'NORMAL'
    END as real_time_alert,

    -- Severity assessment
    CASE 
      WHEN ABS(value - avg_1h) > 5 * NULLIF(stddev_1h, 0) OR ABS(rate_of_change) > 50 THEN 'CRITICAL'
      WHEN ABS(value - avg_1h) > 3 * NULLIF(stddev_1h, 0) OR ABS(rate_of_change) > 20 THEN 'HIGH'
      WHEN battery_level < 15 OR quality_score < 40 THEN 'MEDIUM'
      ELSE 'LOW'
    END as alert_severity

  FROM real_time_metrics
  WHERE timestamp >= NOW() - INTERVAL '15 minutes'
)

SELECT 
  device_id,
  sensor_type,
  timestamp,
  value,
  real_time_alert,
  alert_severity,

  -- Context for immediate action
  ROUND(avg_5m::numeric, 3) as five_min_avg,
  ROUND(avg_1h::numeric, 3) as one_hour_avg,
  ROUND(rate_of_change::numeric, 3) as change_rate,
  count_5m as readings_last_5min,
  battery_level,
  quality_score,

  -- Time since alert
  EXTRACT(EPOCH FROM (NOW() - timestamp))::integer as seconds_ago

FROM real_time_alerts
WHERE real_time_alert != 'NORMAL' 
  AND alert_severity IN ('CRITICAL', 'HIGH', 'MEDIUM')
ORDER BY alert_severity DESC, timestamp DESC
LIMIT 100;

-- Time-series data retention and archival management
WITH retention_analysis AS (
  SELECT 
    device_id,
    sensor_type,
    DATE_TRUNC('day', timestamp) as day_bucket,
    COUNT(*) as daily_readings,
    MIN(timestamp) as first_reading,
    MAX(timestamp) as last_reading,
    AVG(quality_score) as avg_daily_quality,

    -- Age-based classification
    CASE 
      WHEN DATE_TRUNC('day', timestamp) >= CURRENT_DATE - INTERVAL '30 days' THEN 'recent'
      WHEN DATE_TRUNC('day', timestamp) >= CURRENT_DATE - INTERVAL '90 days' THEN 'standard'
      WHEN DATE_TRUNC('day', timestamp) >= CURRENT_DATE - INTERVAL '365 days' THEN 'historical'
      ELSE 'archive'
    END as data_tier,

    -- Storage cost analysis
    COUNT(*) * 0.001 as estimated_storage_mb,
    EXTRACT(DAYS FROM (CURRENT_DATE - DATE_TRUNC('day', timestamp))) as days_old

  FROM sensor_readings
  GROUP BY device_id, sensor_type, DATE_TRUNC('day', timestamp)
)

SELECT 
  data_tier,
  COUNT(DISTINCT device_id) as unique_devices,
  COUNT(DISTINCT sensor_type) as sensor_types,
  SUM(daily_readings) as total_readings,
  ROUND(SUM(estimated_storage_mb)::numeric, 2) as total_storage_mb,
  ROUND(AVG(avg_daily_quality)::numeric, 1) as avg_quality_score,
  MIN(days_old) as newest_data_days,
  MAX(days_old) as oldest_data_days,

  -- Archival recommendations
  CASE 
    WHEN data_tier = 'archive' THEN 'Move to cold storage or delete low-quality data'
    WHEN data_tier = 'historical' THEN 'Consider compression or aggregation to daily summaries'
    WHEN data_tier = 'standard' THEN 'Maintain current storage with periodic cleanup'
    ELSE 'Keep in high-performance storage'
  END as storage_recommendation

FROM retention_analysis
GROUP BY data_tier
ORDER BY 
  CASE data_tier
    WHEN 'recent' THEN 1
    WHEN 'standard' THEN 2
    WHEN 'historical' THEN 3
    WHEN 'archive' THEN 4
  END;

-- QueryLeaf provides comprehensive time-series capabilities:
-- 1. Optimized time-series collection creation with automatic bucketing
-- 2. High-performance ingestion for streaming sensor and IoT data
-- 3. Advanced temporal aggregations with window functions and trend analysis
-- 4. Real-time anomaly detection and alerting systems
-- 5. Geospatial analytics integration for location-aware time-series data
-- 6. Comprehensive data quality monitoring and operational health tracking
-- 7. Intelligent data retention and archival management strategies
-- 8. SQL-familiar syntax for complex time-series analytics and reporting
-- 9. Integration with MongoDB's native time-series optimizations
-- 10. Familiar SQL patterns for temporal data analysis and visualization

Best Practices for Time-Series Implementation

Collection Design Strategy

Essential principles for optimal MongoDB time-series collection design:

  1. Granularity Selection: Choose appropriate granularity based on data frequency and query patterns
  2. Metadata Organization: Structure metadata fields to enable efficient grouping and filtering
  3. Index Strategy: Create indexes that support temporal range queries and metadata filtering
  4. Compression Configuration: Select compression algorithms based on data characteristics
  5. Bucketing Optimization: Monitor bucket sizes and adjust granularity for optimal performance
  6. Storage Planning: Plan for data growth and implement retention policies

Performance and Scalability

Optimize MongoDB time-series collections for production workloads:

  1. Ingestion Optimization: Use batch insertions and optimal write concerns for high throughput
  2. Query Performance: Design aggregation pipelines that leverage time-series optimizations
  3. Real-time Analytics: Implement change streams for real-time processing and alerting
  4. Resource Management: Monitor memory usage and enable disk spilling for large aggregations
  5. Sharding Strategy: Plan horizontal scaling for very high-volume time-series data
  6. Monitoring Setup: Track collection performance, compression ratios, and query patterns

Conclusion

MongoDB Time-Series Collections provide specialized optimization for temporal data that eliminates the performance and storage inefficiencies of traditional time-series approaches. The combination of automatic bucketing, intelligent compression, and time-aware indexing makes handling high-volume IoT and sensor data both efficient and scalable.

Key MongoDB Time-Series benefits include:

  • Automatic Optimization: Built-in bucketing and compression optimized for temporal data patterns
  • Storage Efficiency: Up to 90% storage reduction compared to regular document collections
  • Query Performance: Time-aware indexing and aggregation pipeline optimization
  • High-Throughput Ingestion: Optimized write patterns for streaming sensor data
  • Real-Time Analytics: Integration with change streams for real-time processing
  • Flexible Metadata: Support for complex device and sensor metadata structures

Whether you're building IoT platforms, sensor networks, financial trading systems, or real-time analytics applications, MongoDB Time-Series Collections with QueryLeaf's familiar SQL interface provides the foundation for high-performance temporal data management.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB time-series operations while providing SQL-familiar temporal analytics, window functions, and time-based aggregations. Advanced time-series patterns, real-time alerting, and performance monitoring are seamlessly handled through familiar SQL constructs, making sophisticated temporal analytics both powerful and accessible to SQL-oriented development teams.

The integration of specialized time-series capabilities with SQL-style operations makes MongoDB an ideal platform for applications requiring both high-performance temporal data management and familiar database interaction patterns, ensuring your time-series solutions remain both performant and maintainable as they scale to handle massive data volumes and real-time processing requirements.