Blog

September 20, 2025
18 min read

MongoDB Change Streams and Event-Driven Architecture: Building Reactive Applications with SQL-Style Event Processing

Modern applications increasingly require real-time responsiveness and event-driven architectures that can react instantly to data changes across distributed systems. Traditional polling-based approaches for change detection introduce significant latency, resource overhead, and scaling challenges that make building responsive applications complex and inefficient.

MongoDB Change Streams provide native event streaming capabilities that enable applications to watch for data changes in real-time, triggering immediate reactions without polling overhead. Unlike traditional database triggers or external change data capture systems, MongoDB Change Streams offer a unified, scalable approach to event-driven architecture that works seamlessly across replica sets and sharded clusters.

The Traditional Change Detection Challenge

Traditional approaches to detecting and reacting to data changes have significant architectural and performance limitations:

-- Traditional polling approach - inefficient and high-latency

-- PostgreSQL polling-based change detection
CREATE TABLE user_activities (
    activity_id BIGSERIAL PRIMARY KEY,
    user_id BIGINT NOT NULL,
    activity_type VARCHAR(50) NOT NULL,
    activity_data JSONB,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    processed BOOLEAN DEFAULT FALSE
);

-- Polling query runs every few seconds
SELECT 
    activity_id,
    user_id,
    activity_type,
    activity_data,
    created_at
FROM user_activities 
WHERE processed = FALSE 
ORDER BY created_at ASC 
LIMIT 100;

-- Mark as processed after handling
UPDATE user_activities 
SET processed = TRUE, updated_at = CURRENT_TIMESTAMP
WHERE activity_id IN (1, 2, 3, ...);

-- Problems with polling approach:
-- 1. High latency - changes only detected on poll intervals
-- 2. Resource waste - constant querying even when no changes
-- 3. Scaling issues - increased polling frequency impacts performance
-- 4. Race conditions - multiple consumers competing for same records
-- 5. Complex state management - tracking processed vs unprocessed
-- 6. Poor real-time experience - delays in reaction to changes

-- Database trigger approach (limited and complex)
CREATE OR REPLACE FUNCTION notify_activity_change()
RETURNS TRIGGER AS $$
BEGIN
    PERFORM pg_notify('activity_changes', 
        json_build_object(
            'activity_id', NEW.activity_id,
            'user_id', NEW.user_id,
            'activity_type', NEW.activity_type,
            'operation', TG_OP
        )::text
    );
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER activity_change_trigger
AFTER INSERT OR UPDATE OR DELETE ON user_activities
FOR EACH ROW EXECUTE FUNCTION notify_activity_change();

-- Trigger limitations:
-- - Limited to single database instance
-- - No ordering guarantees across tables
-- - Difficult error handling and retry logic
-- - Complex setup for distributed systems
-- - No built-in filtering or transformation
-- - Poor integration with modern event architectures

-- MySQL limitations (even more restrictive)
CREATE TABLE change_log (
    id INT AUTO_INCREMENT PRIMARY KEY,
    table_name VARCHAR(100),
    record_id VARCHAR(100), 
    operation VARCHAR(10),
    change_data JSON,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Basic trigger for change tracking
DELIMITER $$
CREATE TRIGGER user_change_tracker
AFTER INSERT ON users
FOR EACH ROW
BEGIN
    INSERT INTO change_log (table_name, record_id, operation, change_data)
    VALUES ('users', NEW.id, 'INSERT', JSON_OBJECT('user_id', NEW.id));
END$$
DELIMITER ;

-- MySQL trigger limitations:
-- - Very limited JSON functionality
-- - No advanced event routing capabilities
-- - Poor performance with high-volume changes
-- - Complex maintenance and debugging
-- - No distributed system support

MongoDB Change Streams provide comprehensive event-driven capabilities:

// MongoDB Change Streams - native event-driven architecture
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('event_driven_platform');

// Advanced Change Stream implementation for event-driven architecture
class EventDrivenMongoDBPlatform {
  constructor(db) {
    this.db = db;
    this.changeStreams = new Map();
    this.eventHandlers = new Map();
    this.metrics = {
      eventsProcessed: 0,
      lastEvent: null,
      errorCount: 0
    };
  }

  async setupEventDrivenCollections() {
    // Create collections for different event types
    const collections = {
      userActivities: db.collection('user_activities'),
      orderEvents: db.collection('order_events'),
      inventoryChanges: db.collection('inventory_changes'),
      systemEvents: db.collection('system_events'),
      auditLog: db.collection('audit_log')
    };

    // Create indexes for optimal change stream performance
    for (const [name, collection] of Object.entries(collections)) {
      await collection.createIndex({ userId: 1, timestamp: -1 });
      await collection.createIndex({ eventType: 1, status: 1 });
      await collection.createIndex({ createdAt: -1 });
    }

    return collections;
  }

  async startChangeStreamWatchers() {
    console.log('Starting change stream watchers...');

    // 1. Watch all changes across entire database
    await this.watchDatabaseChanges();

    // 2. Watch specific collection changes with filtering
    await this.watchUserActivityChanges();

    // 3. Watch order processing pipeline
    await this.watchOrderEvents();

    // 4. Watch inventory for real-time stock updates
    await this.watchInventoryChanges();

    console.log('All change stream watchers started');
  }

  async watchDatabaseChanges() {
    console.log('Setting up database-level change stream...');

    const changeStream = this.db.watch(
      [
        // Pipeline to filter and transform events
        {
          $match: {
            // Only watch insert, update, delete operations
            operationType: { $in: ['insert', 'update', 'delete', 'replace'] },

            // Exclude system collections and temporary data
            'ns.coll': { 
              $not: { $regex: '^(system\.|temp_)' }
            }
          }
        },
        {
          $addFields: {
            // Add event metadata
            eventId: { $toString: '$_id' },
            eventTimestamp: '$clusterTime',
            database: '$ns.db',
            collection: '$ns.coll',

            // Create standardized event structure
            eventData: {
              $switch: {
                branches: [
                  {
                    case: { $eq: ['$operationType', 'insert'] },
                    then: {
                      operation: 'created',
                      document: '$fullDocument'
                    }
                  },
                  {
                    case: { $eq: ['$operationType', 'update'] },
                    then: {
                      operation: 'updated', 
                      documentKey: '$documentKey',
                      updatedFields: '$updateDescription.updatedFields',
                      removedFields: '$updateDescription.removedFields'
                    }
                  },
                  {
                    case: { $eq: ['$operationType', 'delete'] },
                    then: {
                      operation: 'deleted',
                      documentKey: '$documentKey'
                    }
                  }
                ],
                default: {
                  operation: '$operationType',
                  documentKey: '$documentKey'
                }
              }
            }
          }
        }
      ],
      {
        fullDocument: 'updateLookup', // Include full document for updates
        fullDocumentBeforeChange: 'whenAvailable' // Include before state
      }
    );

    this.changeStreams.set('database', changeStream);

    // Handle database-level events
    changeStream.on('change', async (changeEvent) => {
      try {
        await this.handleDatabaseEvent(changeEvent);
        this.updateMetrics('database', changeEvent);
      } catch (error) {
        console.error('Error handling database event:', error);
        this.metrics.errorCount++;
      }
    });

    changeStream.on('error', (error) => {
      console.error('Database change stream error:', error);
      this.handleChangeStreamError('database', error);
    });
  }

  async watchUserActivityChanges() {
    console.log('Setting up user activity change stream...');

    const userActivities = this.db.collection('user_activities');

    const changeStream = userActivities.watch(
      [
        {
          $match: {
            operationType: { $in: ['insert', 'update'] },

            // Only watch for significant user activities
            $or: [
              { 'fullDocument.activityType': 'login' },
              { 'fullDocument.activityType': 'purchase' },
              { 'fullDocument.activityType': 'subscription_change' },
              { 'fullDocument.status': 'completed' },
              { 'updateDescription.updatedFields.status': 'completed' }
            ]
          }
        }
      ],
      {
        fullDocument: 'updateLookup',
        fullDocumentBeforeChange: 'whenAvailable'
      }
    );

    this.changeStreams.set('userActivities', changeStream);

    changeStream.on('change', async (changeEvent) => {
      try {
        await this.handleUserActivityEvent(changeEvent);

        // Trigger downstream events based on activity type
        await this.triggerDownstreamEvents('user_activity', changeEvent);

      } catch (error) {
        console.error('Error handling user activity event:', error);
        await this.logEventError('user_activities', changeEvent, error);
      }
    });
  }

  async watchOrderEvents() {
    console.log('Setting up order events change stream...');

    const orderEvents = this.db.collection('order_events');

    const changeStream = orderEvents.watch(
      [
        {
          $match: {
            operationType: 'insert',

            // Order lifecycle events
            'fullDocument.eventType': {
              $in: ['order_created', 'payment_processed', 'order_shipped', 
                   'order_delivered', 'order_cancelled', 'refund_processed']
            }
          }
        },
        {
          $addFields: {
            // Enrich with order context
            orderStage: {
              $switch: {
                branches: [
                  { case: { $eq: ['$fullDocument.eventType', 'order_created'] }, then: 'pending' },
                  { case: { $eq: ['$fullDocument.eventType', 'payment_processed'] }, then: 'confirmed' },
                  { case: { $eq: ['$fullDocument.eventType', 'order_shipped'] }, then: 'in_transit' },
                  { case: { $eq: ['$fullDocument.eventType', 'order_delivered'] }, then: 'completed' },
                  { case: { $eq: ['$fullDocument.eventType', 'order_cancelled'] }, then: 'cancelled' }
                ],
                default: 'unknown'
              }
            },

            // Priority for event processing
            processingPriority: {
              $switch: {
                branches: [
                  { case: { $eq: ['$fullDocument.eventType', 'payment_processed'] }, then: 1 },
                  { case: { $eq: ['$fullDocument.eventType', 'order_created'] }, then: 2 },
                  { case: { $eq: ['$fullDocument.eventType', 'order_cancelled'] }, then: 1 },
                  { case: { $eq: ['$fullDocument.eventType', 'refund_processed'] }, then: 1 }
                ],
                default: 3
              }
            }
          }
        }
      ],
      { fullDocument: 'updateLookup' }
    );

    this.changeStreams.set('orderEvents', changeStream);

    changeStream.on('change', async (changeEvent) => {
      try {
        // Route to appropriate order processing handler
        await this.processOrderEventChange(changeEvent);

        // Update order state machine
        await this.updateOrderStateMachine(changeEvent);

        // Trigger business logic workflows
        await this.triggerOrderWorkflows(changeEvent);

      } catch (error) {
        console.error('Error processing order event:', error);
        await this.handleOrderEventError(changeEvent, error);
      }
    });
  }

  async watchInventoryChanges() {
    console.log('Setting up inventory change stream...');

    const inventoryChanges = this.db.collection('inventory_changes');

    const changeStream = inventoryChanges.watch(
      [
        {
          $match: {
            $or: [
              // Stock level changes
              { 
                operationType: 'update',
                'updateDescription.updatedFields.stockLevel': { $exists: true }
              },
              // New inventory items
              {
                operationType: 'insert',
                'fullDocument.itemType': 'product'
              },
              // Inventory alerts
              {
                operationType: 'insert',
                'fullDocument.alertType': { $in: ['low_stock', 'out_of_stock', 'restock'] }
              }
            ]
          }
        }
      ],
      {
        fullDocument: 'updateLookup',
        fullDocumentBeforeChange: 'whenAvailable'
      }
    );

    this.changeStreams.set('inventoryChanges', changeStream);

    changeStream.on('change', async (changeEvent) => {
      try {
        // Real-time inventory updates
        await this.handleInventoryChange(changeEvent);

        // Check for low stock alerts
        await this.checkInventoryAlerts(changeEvent);

        // Update product availability in real-time
        await this.updateProductAvailability(changeEvent);

        // Notify relevant systems (pricing, recommendations, etc.)
        await this.notifyInventorySubscribers(changeEvent);

      } catch (error) {
        console.error('Error handling inventory change:', error);
        await this.logInventoryError(changeEvent, error);
      }
    });
  }

  async handleDatabaseEvent(changeEvent) {
    const { database, collection, eventData, operationType } = changeEvent;

    console.log(`Database Event: ${operationType} in ${database}.${collection}`);

    // Global event logging
    await this.logGlobalEvent({
      eventId: changeEvent.eventId,
      timestamp: new Date(changeEvent.clusterTime),
      database: database,
      collection: collection,
      operation: operationType,
      eventData: eventData
    });

    // Route to collection-specific handlers
    await this.routeCollectionEvent(collection, changeEvent);

    // Update global metrics and monitoring
    await this.updateGlobalMetrics(changeEvent);
  }

  async handleUserActivityEvent(changeEvent) {
    const { fullDocument, operationType } = changeEvent;
    const activity = fullDocument;

    console.log(`User Activity: ${activity.activityType} for user ${activity.userId}`);

    // Real-time user analytics
    if (activity.activityType === 'login') {
      await this.updateUserSession(activity);
      await this.trackUserLocation(activity);
    }

    // Purchase events
    if (activity.activityType === 'purchase') {
      await this.processRealtimePurchase(activity);
      await this.updateRecommendations(activity.userId);
      await this.triggerLoyaltyUpdates(activity);
    }

    // Subscription changes
    if (activity.activityType === 'subscription_change') {
      await this.processSubscriptionChange(activity);
      await this.updateBilling(activity);
    }

    // Create reactive events for downstream systems
    await this.publishUserEvent(activity, operationType);
  }

  async processOrderEventChange(changeEvent) {
    const { fullDocument: orderEvent } = changeEvent;

    console.log(`Order Event: ${orderEvent.eventType} for order ${orderEvent.orderId}`);

    switch (orderEvent.eventType) {
      case 'order_created':
        await this.processNewOrder(orderEvent);
        break;

      case 'payment_processed':
        await this.confirmOrderPayment(orderEvent);
        await this.triggerFulfillment(orderEvent);
        break;

      case 'order_shipped':
        await this.updateShippingTracking(orderEvent);
        await this.notifyCustomer(orderEvent);
        break;

      case 'order_delivered':
        await this.completeOrder(orderEvent);
        await this.triggerPostDeliveryWorkflow(orderEvent);
        break;

      case 'order_cancelled':
        await this.processCancellation(orderEvent);
        await this.handleRefund(orderEvent);
        break;
    }

    // Update order analytics in real-time
    await this.updateOrderAnalytics(orderEvent);
  }

  async handleInventoryChange(changeEvent) {
    const { fullDocument: inventory, operationType } = changeEvent;

    console.log(`Inventory Change: ${operationType} for item ${inventory.itemId}`);

    // Real-time stock updates
    if (changeEvent.updateDescription?.updatedFields?.stockLevel !== undefined) {
      const newStock = changeEvent.fullDocument.stockLevel;
      const previousStock = changeEvent.fullDocumentBeforeChange?.stockLevel || 0;

      await this.handleStockLevelChange({
        itemId: inventory.itemId,
        previousStock: previousStock,
        newStock: newStock,
        changeAmount: newStock - previousStock
      });
    }

    // Product availability updates
    await this.updateProductCatalog(inventory);

    // Pricing adjustments based on stock levels
    await this.updateDynamicPricing(inventory);
  }

  async triggerDownstreamEvents(eventType, changeEvent) {
    // Message queue integration for external systems
    const event = {
      eventId: generateEventId(),
      eventType: eventType,
      timestamp: new Date(),
      source: 'mongodb-change-stream',
      data: changeEvent,
      version: '1.0'
    };

    // Publish to different channels based on event type
    await this.publishToEventBus(event);
    await this.updateEventSourcing(event);
    await this.triggerWebhooks(event);
  }

  async publishToEventBus(event) {
    // Integration with message queues (Kafka, RabbitMQ, etc.)
    console.log(`Publishing event ${event.eventId} to event bus`);

    // Route to appropriate topics/queues
    const routingKey = `${event.eventType}.${event.data.operationType}`;

    // Simulate message queue publishing
    // await messageQueue.publish(routingKey, event);
  }

  async setupResumeTokenPersistence() {
    // Persist resume tokens for fault tolerance
    const resumeTokens = this.db.collection('change_stream_resume_tokens');

    // Save resume tokens periodically
    setInterval(async () => {
      for (const [streamName, changeStream] of this.changeStreams.entries()) {
        try {
          const resumeToken = changeStream.resumeToken;
          if (resumeToken) {
            await resumeTokens.updateOne(
              { streamName: streamName },
              {
                $set: {
                  resumeToken: resumeToken,
                  lastUpdated: new Date()
                }
              },
              { upsert: true }
            );
          }
        } catch (error) {
          console.error(`Error saving resume token for ${streamName}:`, error);
        }
      }
    }, 10000); // Every 10 seconds
  }

  async handleChangeStreamError(streamName, error) {
    console.error(`Change stream ${streamName} encountered error:`, error);

    // Implement retry logic with exponential backoff
    setTimeout(async () => {
      try {
        console.log(`Attempting to restart change stream: ${streamName}`);

        // Load last known resume token
        const resumeTokenDoc = await this.db.collection('change_stream_resume_tokens')
          .findOne({ streamName: streamName });

        // Restart stream from last known position
        if (resumeTokenDoc?.resumeToken) {
          // Restart with resume token
          await this.restartChangeStream(streamName, resumeTokenDoc.resumeToken);
        } else {
          // Restart from current time
          await this.restartChangeStream(streamName);
        }

      } catch (retryError) {
        console.error(`Failed to restart change stream ${streamName}:`, retryError);
        // Implement exponential backoff retry
      }
    }, 5000); // Initial 5-second delay
  }

  async getChangeStreamMetrics() {
    return {
      activeStreams: this.changeStreams.size,
      eventsProcessed: this.metrics.eventsProcessed,
      lastEventTime: this.metrics.lastEvent,
      errorCount: this.metrics.errorCount,

      streamHealth: Array.from(this.changeStreams.entries()).map(([name, stream]) => ({
        name: name,
        isActive: !stream.closed,
        hasResumeToken: !!stream.resumeToken
      }))
    };
  }

  updateMetrics(streamName, changeEvent) {
    this.metrics.eventsProcessed++;
    this.metrics.lastEvent = new Date();

    console.log(`Processed event from ${streamName}: ${changeEvent.operationType}`);
  }

  async shutdown() {
    console.log('Shutting down change streams...');

    // Close all change streams gracefully
    for (const [name, changeStream] of this.changeStreams.entries()) {
      try {
        await changeStream.close();
        console.log(`Closed change stream: ${name}`);
      } catch (error) {
        console.error(`Error closing change stream ${name}:`, error);
      }
    }

    this.changeStreams.clear();
    console.log('All change streams closed');
  }
}

// Usage example
const startEventDrivenPlatform = async () => {
  try {
    const platform = new EventDrivenMongoDBPlatform(db);

    // Setup collections and indexes
    await platform.setupEventDrivenCollections();

    // Start change stream watchers
    await platform.startChangeStreamWatchers();

    // Setup fault tolerance
    await platform.setupResumeTokenPersistence();

    // Monitor platform health
    setInterval(async () => {
      const metrics = await platform.getChangeStreamMetrics();
      console.log('Platform Metrics:', metrics);
    }, 30000); // Every 30 seconds

    console.log('Event-driven platform started successfully');
    return platform;

  } catch (error) {
    console.error('Error starting event-driven platform:', error);
    throw error;
  }
};

// Benefits of MongoDB Change Streams:
// - Real-time event processing without polling overhead
// - Ordered, durable event streams with resume token support  
// - Cluster-wide change detection across replica sets and shards
// - Rich filtering and transformation capabilities through aggregation pipelines
// - Built-in fault tolerance and automatic failover
// - Integration with MongoDB's ACID transactions
// - Scalable event-driven architecture foundation
// - Native integration with MongoDB ecosystem and tools

module.exports = {
  EventDrivenMongoDBPlatform,
  startEventDrivenPlatform
};

Understanding MongoDB Change Streams Architecture

Advanced Change Stream Patterns

Implement sophisticated change stream patterns for different event-driven scenarios:

// Advanced change stream patterns and event processing
class AdvancedChangeStreamPatterns {
  constructor(db) {
    this.db = db;
    this.eventProcessors = new Map();
    this.eventStore = db.collection('event_store');
    this.eventProjections = db.collection('event_projections');
  }

  async setupEventSourcingPattern() {
    // Event sourcing with change streams
    console.log('Setting up event sourcing pattern...');

    const aggregateCollections = [
      'user_aggregates',
      'order_aggregates', 
      'inventory_aggregates',
      'payment_aggregates'
    ];

    for (const collectionName of aggregateCollections) {
      const collection = this.db.collection(collectionName);

      const changeStream = collection.watch(
        [
          {
            $match: {
              operationType: { $in: ['insert', 'update', 'replace'] }
            }
          },
          {
            $addFields: {
              // Create event sourcing envelope
              eventEnvelope: {
                eventId: { $toString: '$_id' },
                eventType: '$operationType',
                aggregateId: '$documentKey._id',
                aggregateType: collectionName,
                eventVersion: { $ifNull: ['$fullDocument.version', 1] },
                eventData: '$fullDocument',
                eventMetadata: {
                  timestamp: '$clusterTime',
                  source: 'change-stream',
                  causationId: '$fullDocument.causationId',
                  correlationId: '$fullDocument.correlationId'
                }
              }
            }
          }
        ],
        {
          fullDocument: 'updateLookup',
          fullDocumentBeforeChange: 'whenAvailable'
        }
      );

      changeStream.on('change', async (changeEvent) => {
        await this.processEventSourcingEvent(changeEvent);
      });

      this.eventProcessors.set(`${collectionName}_eventsourcing`, changeStream);
    }
  }

  async processEventSourcingEvent(changeEvent) {
    const { eventEnvelope } = changeEvent;

    // Store event in event store
    await this.eventStore.insertOne({
      ...eventEnvelope,
      storedAt: new Date(),
      processedBy: [],
      projectionStatus: 'pending'
    });

    // Update read model projections
    await this.updateProjections(eventEnvelope);

    // Trigger sagas and process managers
    await this.triggerSagas(eventEnvelope);
  }

  async setupCQRSPattern() {
    // Command Query Responsibility Segregation with change streams
    console.log('Setting up CQRS pattern...');

    const commandCollections = ['commands', 'command_results'];

    for (const collectionName of commandCollections) {
      const collection = this.db.collection(collectionName);

      const changeStream = collection.watch(
        [
          {
            $match: {
              operationType: 'insert',
              'fullDocument.status': { $ne: 'processed' }
            }
          }
        ],
        { fullDocument: 'updateLookup' }
      );

      changeStream.on('change', async (changeEvent) => {
        await this.processCommand(changeEvent.fullDocument);
      });

      this.eventProcessors.set(`${collectionName}_cqrs`, changeStream);
    }
  }

  async setupSagaOrchestration() {
    // Saga pattern for distributed transaction coordination
    console.log('Setting up saga orchestration...');

    const sagaCollection = this.db.collection('sagas');

    const changeStream = sagaCollection.watch(
      [
        {
          $match: {
            $or: [
              { operationType: 'insert' },
              { 
                operationType: 'update',
                'updateDescription.updatedFields.status': { $exists: true }
              }
            ]
          }
        }
      ],
      { fullDocument: 'updateLookup' }
    );

    changeStream.on('change', async (changeEvent) => {
      await this.processSagaEvent(changeEvent);
    });

    this.eventProcessors.set('saga_orchestration', changeStream);
  }

  async processSagaEvent(changeEvent) {
    const saga = changeEvent.fullDocument;
    const { sagaId, status, currentStep, steps } = saga;

    console.log(`Processing saga ${sagaId}: ${status} at step ${currentStep}`);

    switch (status) {
      case 'started':
        await this.executeSagaStep(saga, 0);
        break;

      case 'step_completed':
        if (currentStep + 1 < steps.length) {
          await this.executeSagaStep(saga, currentStep + 1);
        } else {
          await this.completeSaga(sagaId);
        }
        break;

      case 'step_failed':
        await this.compensateSaga(saga, currentStep);
        break;

      case 'compensating':
        if (currentStep > 0) {
          await this.executeCompensation(saga, currentStep - 1);
        } else {
          await this.failSaga(sagaId);
        }
        break;
    }
  }

  async setupStreamProcessing() {
    // Stream processing with windowed aggregations
    console.log('Setting up stream processing...');

    const eventStream = this.db.collection('events');

    const changeStream = eventStream.watch(
      [
        {
          $match: {
            operationType: 'insert',
            'fullDocument.eventType': { $in: ['user_activity', 'transaction', 'system_event'] }
          }
        },
        {
          $addFields: {
            processingWindow: {
              $dateTrunc: {
                date: '$fullDocument.timestamp',
                unit: 'minute',
                binSize: 5 // 5-minute windows
              }
            }
          }
        }
      ],
      { fullDocument: 'updateLookup' }
    );

    let windowBuffer = new Map();

    changeStream.on('change', async (changeEvent) => {
      await this.processStreamEvent(changeEvent, windowBuffer);
    });

    // Process window aggregations every minute
    setInterval(async () => {
      await this.processWindowedAggregations(windowBuffer);
    }, 60000);

    this.eventProcessors.set('stream_processing', changeStream);
  }

  async processStreamEvent(changeEvent, windowBuffer) {
    const event = changeEvent.fullDocument;
    const window = changeEvent.processingWindow;
    const windowKey = window.toISOString();

    if (!windowBuffer.has(windowKey)) {
      windowBuffer.set(windowKey, {
        window: window,
        events: [],
        aggregations: {
          count: 0,
          userActivities: 0,
          transactions: 0,
          systemEvents: 0,
          totalValue: 0
        }
      });
    }

    const windowData = windowBuffer.get(windowKey);
    windowData.events.push(event);
    windowData.aggregations.count++;

    // Type-specific aggregations
    switch (event.eventType) {
      case 'user_activity':
        windowData.aggregations.userActivities++;
        break;
      case 'transaction':
        windowData.aggregations.transactions++;
        windowData.aggregations.totalValue += event.amount || 0;
        break;
      case 'system_event':
        windowData.aggregations.systemEvents++;
        break;
    }

    // Real-time alerting for anomalies
    if (windowData.aggregations.count > 1000) {
      await this.triggerVolumeAlert(windowKey, windowData);
    }
  }

  async setupMultiCollectionCoordination() {
    // Coordinate changes across multiple collections
    console.log('Setting up multi-collection coordination...');

    const coordinationConfig = [
      {
        collections: ['users', 'user_preferences', 'user_activities'],
        coordinator: 'userProfileCoordinator'
      },
      {
        collections: ['orders', 'order_items', 'payments', 'shipping'],
        coordinator: 'orderProcessingCoordinator' 
      },
      {
        collections: ['products', 'inventory', 'pricing', 'reviews'],
        coordinator: 'productManagementCoordinator'
      }
    ];

    for (const config of coordinationConfig) {
      await this.setupCollectionCoordinator(config);
    }
  }

  async setupCollectionCoordinator(config) {
    const { collections, coordinator } = config;

    for (const collectionName of collections) {
      const collection = this.db.collection(collectionName);

      const changeStream = collection.watch(
        [
          {
            $match: {
              operationType: { $in: ['insert', 'update', 'delete'] }
            }
          },
          {
            $addFields: {
              coordinationContext: {
                coordinator: coordinator,
                sourceCollection: collectionName,
                relatedCollections: collections.filter(c => c !== collectionName)
              }
            }
          }
        ],
        { fullDocument: 'updateLookup' }
      );

      changeStream.on('change', async (changeEvent) => {
        await this.processCoordinatedChange(changeEvent);
      });

      this.eventProcessors.set(`${collectionName}_${coordinator}`, changeStream);
    }
  }

  async processCoordinatedChange(changeEvent) {
    const { coordinationContext, fullDocument, operationType } = changeEvent;
    const { coordinator, sourceCollection, relatedCollections } = coordinationContext;

    console.log(`Coordinated change in ${sourceCollection} via ${coordinator}`);

    // Execute coordination logic based on coordinator type
    switch (coordinator) {
      case 'userProfileCoordinator':
        await this.coordinateUserProfileChanges(changeEvent);
        break;

      case 'orderProcessingCoordinator':
        await this.coordinateOrderProcessing(changeEvent);
        break;

      case 'productManagementCoordinator':
        await this.coordinateProductManagement(changeEvent);
        break;
    }
  }

  async coordinateUserProfileChanges(changeEvent) {
    const { fullDocument, operationType, ns } = changeEvent;
    const sourceCollection = ns.coll;

    if (sourceCollection === 'users' && operationType === 'update') {
      // User profile updated - sync preferences and activities
      await this.syncUserPreferences(fullDocument._id);
      await this.updateUserActivityContext(fullDocument._id);
    }

    if (sourceCollection === 'user_activities' && operationType === 'insert') {
      // New activity - update user profile analytics
      await this.updateUserAnalytics(fullDocument.userId, fullDocument);
    }
  }

  async setupChangeStreamHealthMonitoring() {
    // Health monitoring and metrics collection
    console.log('Setting up change stream health monitoring...');

    const healthMetrics = {
      totalStreams: 0,
      activeStreams: 0,
      eventsProcessed: 0,
      errorCount: 0,
      lastProcessedEvent: null,
      streamLatency: new Map()
    };

    // Monitor each change stream
    for (const [streamName, changeStream] of this.eventProcessors.entries()) {
      healthMetrics.totalStreams++;

      if (!changeStream.closed) {
        healthMetrics.activeStreams++;
      }

      // Monitor stream latency
      const originalEmit = changeStream.emit;
      changeStream.emit = function(event, ...args) {
        if (event === 'change') {
          const latency = Date.now() - args[0].clusterTime.getTime();
          healthMetrics.streamLatency.set(streamName, latency);
          healthMetrics.lastProcessedEvent = new Date();
          healthMetrics.eventsProcessed++;
        }
        return originalEmit.call(this, event, ...args);
      };

      // Monitor errors
      changeStream.on('error', (error) => {
        healthMetrics.errorCount++;
        console.error(`Stream ${streamName} error:`, error);
      });
    }

    // Periodic health reporting
    setInterval(() => {
      this.reportHealthMetrics(healthMetrics);
    }, 30000); // Every 30 seconds

    return healthMetrics;
  }

  reportHealthMetrics(metrics) {
    const avgLatency = Array.from(metrics.streamLatency.values())
      .reduce((sum, latency) => sum + latency, 0) / metrics.streamLatency.size || 0;

    console.log('Change Stream Health Report:', {
      totalStreams: metrics.totalStreams,
      activeStreams: metrics.activeStreams,
      eventsProcessed: metrics.eventsProcessed,
      errorCount: metrics.errorCount,
      averageLatency: Math.round(avgLatency) + 'ms',
      lastActivity: metrics.lastProcessedEvent
    });
  }

  async shutdown() {
    console.log('Shutting down advanced change stream patterns...');

    for (const [name, processor] of this.eventProcessors.entries()) {
      try {
        await processor.close();
        console.log(`Closed processor: ${name}`);
      } catch (error) {
        console.error(`Error closing processor ${name}:`, error);
      }
    }

    this.eventProcessors.clear();
  }
}

SQL-Style Change Stream Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB Change Stream operations:

-- QueryLeaf change stream operations with SQL-familiar syntax

-- Create change stream watchers with SQL-style syntax
CREATE CHANGE_STREAM user_activity_watcher ON user_activities
WITH (
  operations = ['insert', 'update'],
  full_document = 'updateLookup',
  full_document_before_change = 'whenAvailable'
)
FILTER (
  activity_type IN ('login', 'purchase', 'subscription_change')
  OR status = 'completed'
);

-- Advanced change stream with aggregation pipeline
CREATE CHANGE_STREAM order_processing_watcher ON order_events
WITH (
  operations = ['insert'],
  full_document = 'updateLookup'
)
PIPELINE (
  FILTER (
    event_type IN ('order_created', 'payment_processed', 'order_shipped', 'order_delivered')
  ),
  ADD_FIELDS (
    order_stage = CASE 
      WHEN event_type = 'order_created' THEN 'pending'
      WHEN event_type = 'payment_processed' THEN 'confirmed'
      WHEN event_type = 'order_shipped' THEN 'in_transit'
      WHEN event_type = 'order_delivered' THEN 'completed'
      ELSE 'unknown'
    END,
    processing_priority = CASE
      WHEN event_type = 'payment_processed' THEN 1
      WHEN event_type = 'order_created' THEN 2
      ELSE 3
    END
  )
);

-- Database-level change stream monitoring
CREATE CHANGE_STREAM database_monitor ON DATABASE
WITH (
  operations = ['insert', 'update', 'delete'],
  full_document = 'updateLookup'
)
FILTER (
  -- Exclude system collections
  ns.coll NOT LIKE 'system.%'
  AND ns.coll NOT LIKE 'temp_%'
)
PIPELINE (
  ADD_FIELDS (
    event_id = CAST(_id AS VARCHAR),
    event_timestamp = cluster_time,
    database_name = ns.db,
    collection_name = ns.coll,
    event_data = CASE operation_type
      WHEN 'insert' THEN JSON_BUILD_OBJECT('operation', 'created', 'document', full_document)
      WHEN 'update' THEN JSON_BUILD_OBJECT(
        'operation', 'updated',
        'document_key', document_key,
        'updated_fields', update_description.updated_fields,
        'removed_fields', update_description.removed_fields
      )
      WHEN 'delete' THEN JSON_BUILD_OBJECT('operation', 'deleted', 'document_key', document_key)
      ELSE JSON_BUILD_OBJECT('operation', operation_type, 'document_key', document_key)
    END
  )
);

-- Event-driven reactive queries
WITH CHANGE_STREAM inventory_changes AS (
  SELECT 
    document_key._id as item_id,
    full_document.item_name,
    full_document.stock_level,
    full_document_before_change.stock_level as previous_stock_level,
    operation_type,
    cluster_time as event_time,

    -- Calculate stock change
    full_document.stock_level - COALESCE(full_document_before_change.stock_level, 0) as stock_change

  FROM CHANGE_STREAM ON inventory 
  WHERE operation_type IN ('insert', 'update')
    AND (full_document.stock_level != full_document_before_change.stock_level OR operation_type = 'insert')
),
stock_alerts AS (
  SELECT *,
    CASE 
      WHEN stock_level = 0 THEN 'OUT_OF_STOCK'
      WHEN stock_level <= 10 THEN 'LOW_STOCK' 
      WHEN stock_change > 0 AND previous_stock_level = 0 THEN 'RESTOCKED'
      ELSE 'NORMAL'
    END as alert_type,

    CASE
      WHEN stock_level = 0 THEN 'critical'
      WHEN stock_level <= 10 THEN 'warning'
      WHEN stock_change > 100 THEN 'info'
      ELSE 'normal'
    END as alert_severity

  FROM inventory_changes
)
SELECT 
  item_id,
  item_name,
  stock_level,
  previous_stock_level,
  stock_change,
  alert_type,
  alert_severity,
  event_time,

  -- Generate alert message
  CASE alert_type
    WHEN 'OUT_OF_STOCK' THEN CONCAT('Item ', item_name, ' is now out of stock')
    WHEN 'LOW_STOCK' THEN CONCAT('Item ', item_name, ' is running low (', stock_level, ' remaining)')
    WHEN 'RESTOCKED' THEN CONCAT('Item ', item_name, ' has been restocked (', stock_level, ' units)')
    ELSE CONCAT('Stock updated for ', item_name, ': ', stock_change, ' units')
  END as alert_message

FROM stock_alerts
WHERE alert_type != 'NORMAL'
ORDER BY alert_severity DESC, event_time DESC;

-- Real-time user activity aggregation
WITH CHANGE_STREAM user_events AS (
  SELECT 
    full_document.user_id,
    full_document.activity_type,
    full_document.session_id,
    full_document.timestamp,
    full_document.metadata,
    cluster_time as event_time

  FROM CHANGE_STREAM ON user_activities
  WHERE operation_type = 'insert'
    AND full_document.activity_type IN ('page_view', 'click', 'purchase', 'login')
),
session_aggregations AS (
  SELECT 
    user_id,
    session_id,
    TIME_WINDOW('5 minutes', event_time) as time_window,

    -- Activity counts
    COUNT(*) as total_activities,
    COUNT(*) FILTER (WHERE activity_type = 'page_view') as page_views,
    COUNT(*) FILTER (WHERE activity_type = 'click') as clicks, 
    COUNT(*) FILTER (WHERE activity_type = 'purchase') as purchases,

    -- Session metrics
    MIN(timestamp) as session_start,
    MAX(timestamp) as session_end,
    MAX(timestamp) - MIN(timestamp) as session_duration,

    -- Engagement scoring
    COUNT(DISTINCT metadata.page_url) as unique_pages_visited,
    AVG(EXTRACT(EPOCH FROM (LEAD(timestamp) OVER (ORDER BY timestamp) - timestamp))) as avg_time_between_activities

  FROM user_events
  GROUP BY user_id, session_id, TIME_WINDOW('5 minutes', event_time)
),
user_behavior_insights AS (
  SELECT *,
    -- Engagement level
    CASE 
      WHEN session_duration > INTERVAL '30 minutes' AND clicks > 20 THEN 'highly_engaged'
      WHEN session_duration > INTERVAL '10 minutes' AND clicks > 5 THEN 'engaged'
      WHEN session_duration > INTERVAL '2 minutes' THEN 'browsing'
      ELSE 'quick_visit'
    END as engagement_level,

    -- Conversion indicators
    purchases > 0 as converted_session,
    clicks / GREATEST(page_views, 1) as click_through_rate,

    -- Behavioral patterns
    CASE 
      WHEN unique_pages_visited > 10 THEN 'explorer'
      WHEN avg_time_between_activities > 60 THEN 'reader'
      WHEN clicks > page_views * 2 THEN 'active_clicker'
      ELSE 'standard'
    END as behavior_pattern

  FROM session_aggregations
)
SELECT 
  user_id,
  session_id,
  time_window,
  total_activities,
  page_views,
  clicks,
  purchases,
  session_duration,
  engagement_level,
  behavior_pattern,
  converted_session,
  ROUND(click_through_rate, 3) as ctr,

  -- Real-time recommendations
  CASE behavior_pattern
    WHEN 'explorer' THEN 'Show product recommendations based on browsed categories'
    WHEN 'reader' THEN 'Provide detailed product information and reviews'
    WHEN 'active_clicker' THEN 'Present clear call-to-action buttons and offers'
    ELSE 'Standard personalization approach'
  END as recommendation_strategy

FROM user_behavior_insights
WHERE engagement_level IN ('engaged', 'highly_engaged')
ORDER BY session_start DESC;

-- Event sourcing with change streams
CREATE EVENT_STORE aggregate_events AS
SELECT 
  CAST(cluster_time AS VARCHAR) as event_id,
  operation_type as event_type,
  document_key._id as aggregate_id,
  ns.coll as aggregate_type,
  COALESCE(full_document.version, 1) as event_version,
  full_document as event_data,

  -- Event metadata
  JSON_BUILD_OBJECT(
    'timestamp', cluster_time,
    'source', 'change-stream',
    'causation_id', full_document.causation_id,
    'correlation_id', full_document.correlation_id,
    'user_id', full_document.user_id
  ) as event_metadata

FROM CHANGE_STREAM ON DATABASE
WHERE operation_type IN ('insert', 'update', 'replace')
  AND ns.coll LIKE '%_aggregates'
ORDER BY cluster_time ASC;

-- CQRS read model projections
CREATE MATERIALIZED VIEW user_profile_projection AS
WITH user_events AS (
  SELECT *
  FROM aggregate_events
  WHERE aggregate_type = 'user_aggregates'
    AND event_type IN ('insert', 'update')
  ORDER BY event_version ASC
),
profile_changes AS (
  SELECT 
    aggregate_id as user_id,
    event_data.email,
    event_data.first_name,
    event_data.last_name,
    event_data.preferences,
    event_data.subscription_status,
    event_data.total_orders,
    event_data.lifetime_value,
    event_metadata.timestamp as last_updated,

    -- Calculate derived fields
    ROW_NUMBER() OVER (PARTITION BY aggregate_id ORDER BY event_version DESC) as rn

  FROM user_events
)
SELECT 
  user_id,
  email,
  CONCAT(first_name, ' ', last_name) as full_name,
  preferences,
  subscription_status,
  total_orders,
  lifetime_value,
  last_updated,

  -- User segments
  CASE 
    WHEN lifetime_value > 1000 THEN 'premium'
    WHEN total_orders > 10 THEN 'loyal'
    WHEN total_orders > 0 THEN 'customer'
    ELSE 'prospect'
  END as user_segment,

  -- Activity status
  CASE 
    WHEN last_updated >= CURRENT_TIMESTAMP - INTERVAL '7 days' THEN 'active'
    WHEN last_updated >= CURRENT_TIMESTAMP - INTERVAL '30 days' THEN 'recent'
    WHEN last_updated >= CURRENT_TIMESTAMP - INTERVAL '90 days' THEN 'inactive'
    ELSE 'dormant'
  END as activity_status

FROM profile_changes
WHERE rn = 1; -- Latest version only

-- Saga orchestration monitoring
WITH CHANGE_STREAM saga_events AS (
  SELECT 
    full_document.saga_id,
    full_document.saga_type,
    full_document.status,
    full_document.current_step,
    full_document.steps,
    full_document.started_at,
    full_document.completed_at,
    cluster_time as event_time,
    operation_type

  FROM CHANGE_STREAM ON sagas
  WHERE operation_type IN ('insert', 'update')
),
saga_monitoring AS (
  SELECT 
    saga_id,
    saga_type,
    status,
    current_step,
    ARRAY_LENGTH(steps, 1) as total_steps,
    started_at,
    completed_at,
    event_time,

    -- Progress calculation
    CASE 
      WHEN status = 'completed' THEN 100.0
      WHEN status = 'failed' THEN 0.0
      WHEN total_steps > 0 THEN (current_step::numeric / total_steps) * 100.0
      ELSE 0.0
    END as progress_percentage,

    -- Duration tracking
    CASE 
      WHEN completed_at IS NOT NULL THEN completed_at - started_at
      ELSE CURRENT_TIMESTAMP - started_at
    END as duration,

    -- Status classification
    CASE status
      WHEN 'completed' THEN 'success'
      WHEN 'failed' THEN 'error'
      WHEN 'compensating' THEN 'warning'
      WHEN 'started' THEN 'in_progress'
      ELSE 'unknown'
    END as status_category

  FROM saga_events
),
saga_health AS (
  SELECT 
    saga_type,
    status_category,
    COUNT(*) as saga_count,
    AVG(progress_percentage) as avg_progress,
    AVG(EXTRACT(EPOCH FROM duration)) as avg_duration_seconds,

    -- Performance metrics
    COUNT(*) FILTER (WHERE status = 'completed') as success_count,
    COUNT(*) FILTER (WHERE status = 'failed') as failure_count,
    COUNT(*) FILTER (WHERE duration > INTERVAL '5 minutes') as slow_saga_count

  FROM saga_monitoring
  WHERE event_time >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
  GROUP BY saga_type, status_category
)
SELECT 
  saga_type,
  status_category,
  saga_count,
  ROUND(avg_progress, 1) as avg_progress_pct,
  ROUND(avg_duration_seconds, 2) as avg_duration_sec,
  success_count,
  failure_count,
  slow_saga_count,

  -- Health indicators
  CASE 
    WHEN failure_count > success_count THEN 'unhealthy'
    WHEN slow_saga_count > saga_count * 0.5 THEN 'degraded'
    ELSE 'healthy'
  END as health_status,

  -- Success rate
  CASE 
    WHEN (success_count + failure_count) > 0 
    THEN ROUND((success_count::numeric / (success_count + failure_count)) * 100, 1)
    ELSE 0.0
  END as success_rate_pct

FROM saga_health
ORDER BY saga_type, status_category;

-- Resume token management for fault tolerance
CREATE TABLE change_stream_resume_tokens (
  stream_name VARCHAR(100) PRIMARY KEY,
  resume_token DOCUMENT NOT NULL,
  last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  stream_config DOCUMENT,

  -- Health tracking
  last_event_time TIMESTAMP,
  error_count INTEGER DEFAULT 0,
  restart_count INTEGER DEFAULT 0
);

-- Monitoring and alerting for change streams
WITH stream_health AS (
  SELECT 
    stream_name,
    resume_token,
    last_updated,
    last_event_time,
    error_count,
    restart_count,

    -- Health calculation
    CURRENT_TIMESTAMP - last_event_time as time_since_last_event,
    CURRENT_TIMESTAMP - last_updated as time_since_update,

    CASE 
      WHEN last_event_time IS NULL THEN 'never_active'
      WHEN CURRENT_TIMESTAMP - last_event_time > INTERVAL '5 minutes' THEN 'stalled'
      WHEN error_count > 5 THEN 'error_prone'
      WHEN restart_count > 3 THEN 'unstable'
      ELSE 'healthy'
    END as health_status

  FROM change_stream_resume_tokens
)
SELECT 
  stream_name,
  health_status,
  EXTRACT(EPOCH FROM time_since_last_event) as seconds_since_last_event,
  error_count,
  restart_count,

  -- Alert conditions
  CASE health_status
    WHEN 'never_active' THEN 'Stream has never processed events - check configuration'
    WHEN 'stalled' THEN 'Stream has not processed events recently - investigate connectivity'
    WHEN 'error_prone' THEN 'High error rate - review error logs and handlers'
    WHEN 'unstable' THEN 'Frequent restarts - check resource limits and stability'
    ELSE 'Stream operating normally'
  END as alert_message,

  CASE health_status
    WHEN 'never_active' THEN 'critical'
    WHEN 'stalled' THEN 'warning'  
    WHEN 'error_prone' THEN 'warning'
    WHEN 'unstable' THEN 'info'
    ELSE 'normal'
  END as alert_severity

FROM stream_health
WHERE health_status != 'healthy'
ORDER BY 
  CASE health_status
    WHEN 'never_active' THEN 1
    WHEN 'stalled' THEN 2
    WHEN 'error_prone' THEN 3
    WHEN 'unstable' THEN 4
    ELSE 5
  END;

-- QueryLeaf provides comprehensive change stream capabilities:
-- 1. SQL-familiar change stream creation and management syntax
-- 2. Real-time event processing with filtering and transformation
-- 3. Event-driven architecture patterns (CQRS, Event Sourcing, Sagas)
-- 4. Advanced stream processing with windowed aggregations
-- 5. Fault tolerance with resume token management
-- 6. Health monitoring and alerting for change streams
-- 7. Integration with MongoDB's native change stream optimizations
-- 8. Reactive query patterns for real-time analytics
-- 9. Multi-collection coordination and event correlation
-- 10. Familiar SQL syntax for complex event-driven applications

Best Practices for Change Stream Implementation

Event-Driven Architecture Design

Essential patterns for building robust event-driven systems:

Event Schema Design: Create consistent event schemas with proper versioning and backward compatibility
Resume Token Management: Implement reliable resume token persistence for fault tolerance
Error Handling: Design comprehensive error handling with retry logic and dead letter queues
Ordering Guarantees: Understand MongoDB's ordering guarantees and design accordingly
Filtering Optimization: Use aggregation pipelines to filter events at the database level
Resource Management: Monitor memory usage and connection limits for change streams

Performance and Scalability

Optimize change streams for high-performance event processing:

Connection Pooling: Use appropriate connection pooling for change stream connections
Batch Processing: Process events in batches where possible to improve throughput
Parallel Processing: Design for parallel event processing while maintaining ordering
Resource Limits: Set appropriate limits on change stream cursors and connections
Monitoring: Implement comprehensive monitoring for stream health and performance
Graceful Degradation: Design fallback mechanisms for change stream failures

Conclusion

MongoDB Change Streams provide native event-driven architecture capabilities that eliminate the complexity and limitations of traditional polling and trigger-based approaches. The ability to react to data changes in real-time with ordered, resumable event streams makes building responsive, scalable applications both powerful and elegant.

Key Change Streams benefits include:

Real-Time Reactivity: Instant response to data changes without polling overhead
Ordered Event Processing: Guaranteed ordering within shards with resume token support
Scalable Architecture: Works seamlessly across replica sets and sharded clusters
Rich Filtering: Aggregation pipeline support for sophisticated event filtering and transformation
Fault Tolerance: Built-in resume capabilities and error handling for production reliability
Ecosystem Integration: Native integration with MongoDB's ACID transactions and tooling

Whether you're building microservices architectures, real-time dashboards, event sourcing systems, or any application requiring immediate response to data changes, MongoDB Change Streams with QueryLeaf's familiar SQL interface provides the foundation for modern event-driven applications.

QueryLeaf Integration: QueryLeaf seamlessly manages MongoDB Change Streams while providing SQL-familiar event processing syntax, change detection patterns, and reactive query capabilities. Advanced event-driven architecture patterns including CQRS, Event Sourcing, and Sagas are elegantly handled through familiar SQL constructs, making sophisticated reactive applications both powerful and accessible to SQL-oriented development teams.

The combination of native change stream capabilities with SQL-style event processing makes MongoDB an ideal platform for applications requiring both real-time responsiveness and familiar database interaction patterns, ensuring your event-driven solutions remain both effective and maintainable as they evolve and scale.

September 19, 2025
26 min read

MongoDB Capped Collections and Circular Buffers: High-Performance Logging and Event Storage with SQL-Style Data Management

High-performance applications generate massive volumes of log data, events, and operational metrics that require specialized storage patterns optimized for write-heavy workloads, automatic size management, and chronological data access. Traditional database approaches for logging and event storage struggle with write performance bottlenecks, complex rotation mechanisms, and inefficient space utilization when dealing with continuous data streams.

MongoDB Capped Collections provide purpose-built capabilities for circular buffer patterns, offering fixed-size collections with automatic document rotation, natural insertion-order preservation, and optimized write performance. Unlike traditional logging solutions that require complex partitioning schemes or external rotation tools, capped collections automatically manage storage limits while maintaining chronological access patterns essential for debugging, monitoring, and real-time analytics.

The Traditional Logging Storage Challenge

Conventional approaches to high-volume logging and event storage have significant limitations for modern applications:

-- Traditional relational logging approach - complex and performance-limited

-- PostgreSQL log storage with manual partitioning and rotation
CREATE TABLE application_logs (
    log_id BIGSERIAL PRIMARY KEY,
    application_name VARCHAR(100) NOT NULL,
    service_name VARCHAR(100) NOT NULL,
    instance_id VARCHAR(100),
    log_level VARCHAR(20) NOT NULL,
    message TEXT NOT NULL,

    -- Structured log data
    request_id VARCHAR(100),
    user_id BIGINT,
    session_id VARCHAR(100),
    trace_id VARCHAR(100),
    span_id VARCHAR(100),

    -- Context information  
    source_file VARCHAR(255),
    source_line INTEGER,
    function_name VARCHAR(255),
    thread_id INTEGER,

    -- Metadata
    hostname VARCHAR(255),
    environment VARCHAR(50),
    version VARCHAR(50),

    -- Log data
    log_data JSONB,
    error_stack TEXT,

    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,

    -- Partitioning key
    partition_date DATE GENERATED ALWAYS AS (created_at::date) STORED

) PARTITION BY RANGE (partition_date);

-- Create monthly partitions (manual maintenance required)
CREATE TABLE application_logs_2024_01 PARTITION OF application_logs
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
CREATE TABLE application_logs_2024_02 PARTITION OF application_logs  
    FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
CREATE TABLE application_logs_2024_03 PARTITION OF application_logs
    FOR VALUES FROM ('2024-03-01') TO ('2024-04-01');
-- ... manual partition creation continues

-- Indexes for log queries (high overhead on writes)
CREATE INDEX idx_logs_app_service_time ON application_logs (application_name, service_name, created_at);
CREATE INDEX idx_logs_level_time ON application_logs (log_level, created_at);
CREATE INDEX idx_logs_request_id ON application_logs (request_id) WHERE request_id IS NOT NULL;
CREATE INDEX idx_logs_user_id_time ON application_logs (user_id, created_at) WHERE user_id IS NOT NULL;
CREATE INDEX idx_logs_trace_id ON application_logs (trace_id) WHERE trace_id IS NOT NULL;

-- Complex log rotation and cleanup procedure
CREATE OR REPLACE FUNCTION cleanup_old_log_partitions()
RETURNS void AS $$
DECLARE
    partition_name TEXT;
    cutoff_date DATE;
BEGIN
    -- Calculate cutoff date (e.g., 90 days retention)
    cutoff_date := CURRENT_DATE - INTERVAL '90 days';

    -- Find and drop old partitions
    FOR partition_name IN 
        SELECT schemaname||'.'||tablename 
        FROM pg_tables 
        WHERE tablename LIKE 'application_logs_____'
        AND tablename < 'application_logs_' || to_char(cutoff_date, 'YYYY_MM')
    LOOP
        EXECUTE 'DROP TABLE IF EXISTS ' || partition_name || ' CASCADE';
        RAISE NOTICE 'Dropped old partition: %', partition_name;
    END LOOP;
END;
$$ LANGUAGE plpgsql;

-- Schedule cleanup job (requires external scheduler)
-- SELECT cron.schedule('cleanup-logs', '0 2 * * 0', 'SELECT cleanup_old_log_partitions();');

-- Complex log analysis query with performance issues
WITH recent_logs AS (
    SELECT 
        application_name,
        service_name,
        log_level,
        message,
        request_id,
        user_id,
        trace_id,
        log_data,
        created_at,

        -- Row number for chronological ordering
        ROW_NUMBER() OVER (
            PARTITION BY application_name, service_name 
            ORDER BY created_at DESC
        ) as rn,

        -- Lag for time between log entries
        LAG(created_at) OVER (
            PARTITION BY application_name, service_name 
            ORDER BY created_at
        ) as prev_log_time

    FROM application_logs
    WHERE created_at >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
      AND log_level IN ('ERROR', 'WARN', 'INFO')
),
error_analysis AS (
    SELECT 
        application_name,
        service_name,
        COUNT(*) as total_logs,
        COUNT(*) FILTER (WHERE log_level = 'ERROR') as error_count,
        COUNT(*) FILTER (WHERE log_level = 'WARN') as warning_count,
        COUNT(*) FILTER (WHERE log_level = 'INFO') as info_count,

        -- Error patterns
        array_agg(DISTINCT message) FILTER (WHERE log_level = 'ERROR') as error_messages,
        COUNT(DISTINCT request_id) as unique_requests,
        COUNT(DISTINCT user_id) as affected_users,

        -- Timing analysis
        AVG(EXTRACT(EPOCH FROM (created_at - prev_log_time))) as avg_log_interval,

        -- Recent errors for immediate attention
        array_agg(
            json_build_object(
                'message', message,
                'created_at', created_at,
                'trace_id', trace_id,
                'request_id', request_id
            ) ORDER BY created_at DESC
        ) FILTER (WHERE log_level = 'ERROR' AND rn <= 10) as recent_errors

    FROM recent_logs
    GROUP BY application_name, service_name
),
log_volume_trends AS (
    SELECT 
        application_name,
        service_name,
        DATE_TRUNC('minute', created_at) as minute_bucket,
        COUNT(*) as logs_per_minute,
        COUNT(*) FILTER (WHERE log_level = 'ERROR') as errors_per_minute
    FROM application_logs
    WHERE created_at >= CURRENT_TIMESTAMP - INTERVAL '2 hours'
    GROUP BY application_name, service_name, DATE_TRUNC('minute', created_at)
)
SELECT 
    ea.application_name,
    ea.service_name,
    ea.total_logs,
    ea.error_count,
    ea.warning_count,
    ea.info_count,
    ROUND((ea.error_count::numeric / ea.total_logs) * 100, 2) as error_rate_percent,
    ea.unique_requests,
    ea.affected_users,
    ROUND(ea.avg_log_interval::numeric, 3) as avg_seconds_between_logs,

    -- Volume trend analysis
    (
        SELECT AVG(logs_per_minute)
        FROM log_volume_trends lvt 
        WHERE lvt.application_name = ea.application_name 
          AND lvt.service_name = ea.service_name
    ) as avg_logs_per_minute,

    (
        SELECT MAX(logs_per_minute)
        FROM log_volume_trends lvt
        WHERE lvt.application_name = ea.application_name
          AND lvt.service_name = ea.service_name  
    ) as peak_logs_per_minute,

    -- Top error messages
    (
        SELECT string_agg(error_msg, '; ') 
        FROM unnest(ea.error_messages) as error_msg
        LIMIT 3
    ) as top_error_messages,

    ea.recent_errors

FROM error_analysis ea
ORDER BY ea.error_count DESC, ea.total_logs DESC;

-- Problems with traditional logging approach:
-- 1. Complex partition management and maintenance overhead
-- 2. Write performance degradation with increasing indexes
-- 3. Manual log rotation and cleanup procedures
-- 4. Storage space management challenges
-- 5. Query performance issues across multiple partitions
-- 6. Complex chronological ordering requirements
-- 7. High operational overhead for high-volume logging
-- 8. Scalability limitations with increasing log volumes
-- 9. Backup and restore complexity with partitioned tables
-- 10. Limited flexibility for varying log data structures

-- MySQL logging limitations (even more restrictive)
CREATE TABLE mysql_logs (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    app_name VARCHAR(100),
    level VARCHAR(20),
    message TEXT,
    log_data JSON,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- MySQL partitioning limitations
    INDEX idx_time_level (created_at, level),
    INDEX idx_app_time (app_name, created_at)
) 
-- Basic range partitioning (limited functionality)
PARTITION BY RANGE (UNIX_TIMESTAMP(created_at)) (
    PARTITION p2024_q1 VALUES LESS THAN (UNIX_TIMESTAMP('2024-04-01')),
    PARTITION p2024_q2 VALUES LESS THAN (UNIX_TIMESTAMP('2024-07-01')),
    PARTITION p2024_q3 VALUES LESS THAN (UNIX_TIMESTAMP('2024-10-01')),
    PARTITION p2024_q4 VALUES LESS THAN (UNIX_TIMESTAMP('2025-01-01'))
);

-- Basic log query in MySQL (limited analytical capabilities)
SELECT 
    app_name,
    level,
    COUNT(*) as log_count,
    MAX(created_at) as latest_log
FROM mysql_logs
WHERE created_at >= DATE_SUB(NOW(), INTERVAL 1 HOUR)
  AND level IN ('ERROR', 'WARN')
GROUP BY app_name, level
ORDER BY log_count DESC
LIMIT 20;

-- MySQL limitations:
-- - Limited JSON functionality compared to PostgreSQL
-- - Basic partitioning capabilities only  
-- - Poor performance with high-volume inserts
-- - Limited analytical query capabilities
-- - No advanced window functions
-- - Complex maintenance procedures
-- - Storage engine limitations for write-heavy workloads

MongoDB Capped Collections provide optimized circular buffer capabilities:

// MongoDB Capped Collections - purpose-built for high-performance logging
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('logging_platform');

// Create capped collections for different log types and performance requirements
const createOptimizedCappedCollections = async () => {
  try {
    // High-volume application logs - 1GB circular buffer
    await db.createCollection('application_logs', {
      capped: true,
      size: 1024 * 1024 * 1024, // 1GB maximum size
      max: 10000000 // Maximum 10 million documents (optional limit)
    });

    // Error logs - smaller, longer retention
    await db.createCollection('error_logs', {
      capped: true,
      size: 256 * 1024 * 1024, // 256MB maximum size
      max: 1000000 // Maximum 1 million error documents
    });

    // Access logs - high throughput, shorter retention
    await db.createCollection('access_logs', {
      capped: true,
      size: 2 * 1024 * 1024 * 1024, // 2GB maximum size
      // No max document limit for maximum throughput
    });

    // Performance metrics - structured time-series data
    await db.createCollection('performance_metrics', {
      capped: true,
      size: 512 * 1024 * 1024, // 512MB maximum size
      max: 5000000 // Maximum 5 million metric points
    });

    // Audit trail - compliance and security logs
    await db.createCollection('audit_logs', {
      capped: true,
      size: 128 * 1024 * 1024, // 128MB maximum size
      max: 500000 // Maximum 500k audit events
    });

    console.log('Capped collections created successfully');

    // Create indexes for common query patterns (minimal overhead)
    await createOptimalIndexes();

    return {
      applicationLogs: db.collection('application_logs'),
      errorLogs: db.collection('error_logs'),
      accessLogs: db.collection('access_logs'),
      performanceMetrics: db.collection('performance_metrics'),
      auditLogs: db.collection('audit_logs')
    };

  } catch (error) {
    console.error('Error creating capped collections:', error);
    throw error;
  }
};

async function createOptimalIndexes() {
  // Minimal indexes for capped collections to maintain write performance
  // Note: Capped collections maintain insertion order automatically

  // Application logs - service and level queries
  await db.collection('application_logs').createIndex({ 
    'service': 1, 
    'level': 1 
  });

  // Error logs - application and timestamp queries
  await db.collection('error_logs').createIndex({ 
    'application': 1, 
    'timestamp': -1 
  });

  // Access logs - endpoint performance analysis
  await db.collection('access_logs').createIndex({ 
    'endpoint': 1, 
    'status_code': 1 
  });

  // Performance metrics - metric type and timestamp
  await db.collection('performance_metrics').createIndex({ 
    'metric_type': 1, 
    'instance_id': 1 
  });

  // Audit logs - user and action queries
  await db.collection('audit_logs').createIndex({ 
    'user_id': 1, 
    'action': 1 
  });

  console.log('Optimal indexes created for capped collections');
}

// High-performance log ingestion with batch processing
const logIngestionSystem = {
  collections: null,
  buffers: new Map(),
  batchSizes: {
    application_logs: 1000,
    error_logs: 100,
    access_logs: 2000,
    performance_metrics: 500,
    audit_logs: 50
  },
  flushIntervals: new Map(),

  async initialize() {
    this.collections = await createOptimizedCappedCollections();

    // Start batch flush timers for each collection
    for (const [collectionName, batchSize] of Object.entries(this.batchSizes)) {
      this.buffers.set(collectionName, []);

      // Flush timer based on expected volume
      const flushInterval = collectionName === 'access_logs' ? 1000 : // 1 second
                           collectionName === 'application_logs' ? 2000 : // 2 seconds
                           5000; // 5 seconds for others

      const intervalId = setInterval(
        () => this.flushBuffer(collectionName), 
        flushInterval
      );

      this.flushIntervals.set(collectionName, intervalId);
    }

    console.log('Log ingestion system initialized');
  },

  async logApplicationEvent(logEntry) {
    // Structured application log entry
    const document = {
      timestamp: new Date(),
      application: logEntry.application || 'unknown',
      service: logEntry.service || 'unknown',
      instance: logEntry.instance || process.env.HOSTNAME || 'unknown',
      level: logEntry.level || 'INFO',
      message: logEntry.message,

      // Request context
      request: {
        id: logEntry.requestId,
        method: logEntry.method,
        endpoint: logEntry.endpoint,
        user_id: logEntry.userId,
        session_id: logEntry.sessionId,
        ip_address: logEntry.ipAddress
      },

      // Trace context
      trace: {
        trace_id: logEntry.traceId,
        span_id: logEntry.spanId,
        parent_span_id: logEntry.parentSpanId,
        flags: logEntry.traceFlags
      },

      // Source information
      source: {
        file: logEntry.sourceFile,
        line: logEntry.sourceLine,
        function: logEntry.functionName,
        thread: logEntry.threadId
      },

      // Environment context
      environment: {
        name: logEntry.environment || process.env.NODE_ENV || 'development',
        version: logEntry.version || process.env.APP_VERSION || '1.0.0',
        build: logEntry.build || process.env.BUILD_ID,
        commit: logEntry.commit || process.env.GIT_COMMIT
      },

      // Structured data
      data: logEntry.data || {},

      // Performance metrics
      metrics: {
        duration_ms: logEntry.duration,
        memory_mb: logEntry.memoryUsage,
        cpu_percent: logEntry.cpuUsage
      },

      // Error context (if applicable)
      error: logEntry.error ? {
        name: logEntry.error.name,
        message: logEntry.error.message,
        stack: logEntry.error.stack,
        code: logEntry.error.code,
        details: logEntry.error.details
      } : null
    };

    await this.bufferDocument('application_logs', document);
  },

  async logAccessEvent(accessEntry) {
    // HTTP access log optimized for high throughput
    const document = {
      timestamp: new Date(),

      // Request details
      method: accessEntry.method,
      endpoint: accessEntry.endpoint,
      path: accessEntry.path,
      query_string: accessEntry.queryString,

      // Response details
      status_code: accessEntry.statusCode,
      response_size: accessEntry.responseSize,
      content_type: accessEntry.contentType,

      // Timing information
      duration_ms: accessEntry.duration,
      queue_time_ms: accessEntry.queueTime,
      process_time_ms: accessEntry.processTime,

      // Client information
      client: {
        ip: accessEntry.clientIp,
        user_agent: accessEntry.userAgent,
        referer: accessEntry.referer,
        user_id: accessEntry.userId,
        session_id: accessEntry.sessionId
      },

      // Geographic data (if available)
      geo: accessEntry.geo ? {
        country: accessEntry.geo.country,
        region: accessEntry.geo.region,
        city: accessEntry.geo.city,
        coordinates: accessEntry.geo.coordinates
      } : null,

      // Application context
      application: accessEntry.application,
      service: accessEntry.service,
      instance: accessEntry.instance || process.env.HOSTNAME,
      version: accessEntry.version,

      // Cache information
      cache: {
        hit: accessEntry.cacheHit,
        key: accessEntry.cacheKey,
        ttl: accessEntry.cacheTTL
      },

      // Load balancing and routing
      routing: {
        backend: accessEntry.backend,
        upstream_time: accessEntry.upstreamTime,
        retry_count: accessEntry.retryCount
      }
    };

    await this.bufferDocument('access_logs', document);
  },

  async logPerformanceMetric(metricEntry) {
    // System and application performance metrics
    const document = {
      timestamp: new Date(),

      metric_type: metricEntry.type, // 'cpu', 'memory', 'disk', 'network', 'application'
      metric_name: metricEntry.name,
      value: metricEntry.value,
      unit: metricEntry.unit,

      // Instance information
      instance_id: metricEntry.instanceId || process.env.HOSTNAME,
      application: metricEntry.application,
      service: metricEntry.service,

      // Dimensional metadata
      dimensions: metricEntry.dimensions || {},

      // Aggregation information
      aggregation: {
        type: metricEntry.aggregationType, // 'gauge', 'counter', 'histogram', 'summary'
        interval_seconds: metricEntry.intervalSeconds,
        sample_count: metricEntry.sampleCount
      },

      // Statistical data (for histograms/summaries)
      statistics: metricEntry.statistics ? {
        min: metricEntry.statistics.min,
        max: metricEntry.statistics.max,
        mean: metricEntry.statistics.mean,
        median: metricEntry.statistics.median,
        p95: metricEntry.statistics.p95,
        p99: metricEntry.statistics.p99,
        std_dev: metricEntry.statistics.stdDev
      } : null,

      // Alerts and thresholds
      alerts: {
        warning_threshold: metricEntry.warningThreshold,
        critical_threshold: metricEntry.criticalThreshold,
        is_anomaly: metricEntry.isAnomaly,
        anomaly_score: metricEntry.anomalyScore
      }
    };

    await this.bufferDocument('performance_metrics', document);
  },

  async logAuditEvent(auditEntry) {
    // Security and compliance audit logging
    const document = {
      timestamp: new Date(),

      // Event classification
      event_type: auditEntry.eventType, // 'authentication', 'authorization', 'data_access', 'configuration'
      event_category: auditEntry.category, // 'security', 'compliance', 'operational'
      severity: auditEntry.severity || 'INFO',

      // Actor information
      actor: {
        user_id: auditEntry.userId,
        username: auditEntry.username,
        email: auditEntry.email,
        roles: auditEntry.roles || [],
        groups: auditEntry.groups || [],
        is_service_account: auditEntry.isServiceAccount || false,
        authentication_method: auditEntry.authMethod
      },

      // Target resource
      target: {
        resource_type: auditEntry.resourceType,
        resource_id: auditEntry.resourceId,
        resource_name: auditEntry.resourceName,
        owner: auditEntry.resourceOwner,
        classification: auditEntry.dataClassification
      },

      // Action details
      action: {
        type: auditEntry.action, // 'create', 'read', 'update', 'delete', 'login', 'logout'
        description: auditEntry.description,
        result: auditEntry.result, // 'success', 'failure', 'partial'
        reason: auditEntry.reason
      },

      // Request context
      request: {
        id: auditEntry.requestId,
        source_ip: auditEntry.sourceIp,
        user_agent: auditEntry.userAgent,
        session_id: auditEntry.sessionId,
        api_key: auditEntry.apiKey ? 'REDACTED' : null
      },

      // Data changes (for modification events)
      changes: auditEntry.changes ? {
        before: auditEntry.changes.before,
        after: auditEntry.changes.after,
        fields_changed: auditEntry.changes.fieldsChanged || []
      } : null,

      // Compliance and regulatory
      compliance: {
        regulation: auditEntry.regulation, // 'GDPR', 'SOX', 'HIPAA', 'PCI-DSS'
        retention_period: auditEntry.retentionPeriod,
        encryption_required: auditEntry.encryptionRequired || false
      },

      // Application context
      application: auditEntry.application,
      service: auditEntry.service,
      environment: auditEntry.environment
    };

    await this.bufferDocument('audit_logs', document);
  },

  async bufferDocument(collectionName, document) {
    const buffer = this.buffers.get(collectionName);
    if (!buffer) {
      console.error(`Unknown collection: ${collectionName}`);
      return;
    }

    buffer.push(document);

    // Flush buffer if it reaches batch size
    if (buffer.length >= this.batchSizes[collectionName]) {
      await this.flushBuffer(collectionName);
    }
  },

  async flushBuffer(collectionName) {
    const buffer = this.buffers.get(collectionName);
    if (!buffer || buffer.length === 0) {
      return;
    }

    // Move buffer contents to local array and clear buffer
    const documents = buffer.splice(0);

    try {
      const collection = this.collections[this.getCollectionProperty(collectionName)];
      if (!collection) {
        console.error(`Collection not found: ${collectionName}`);
        return;
      }

      // High-performance batch insert
      const result = await collection.insertMany(documents, {
        ordered: false, // Allow parallel inserts
        writeConcern: { w: 1, j: false } // Optimize for speed
      });

      if (result.insertedCount !== documents.length) {
        console.warn(`Partial insert: ${result.insertedCount}/${documents.length} documents inserted to ${collectionName}`);
      }

    } catch (error) {
      console.error(`Error flushing buffer for ${collectionName}:`, error);

      // Re-add documents to buffer for retry (optional)
      if (error.code !== 11000) { // Not a duplicate key error
        buffer.unshift(...documents);
      }
    }
  },

  getCollectionProperty(collectionName) {
    const mapping = {
      'application_logs': 'applicationLogs',
      'error_logs': 'errorLogs',
      'access_logs': 'accessLogs',
      'performance_metrics': 'performanceMetrics',
      'audit_logs': 'auditLogs'
    };
    return mapping[collectionName];
  },

  async shutdown() {
    console.log('Shutting down log ingestion system...');

    // Clear all flush intervals
    for (const intervalId of this.flushIntervals.values()) {
      clearInterval(intervalId);
    }

    // Flush all remaining buffers
    const flushPromises = [];
    for (const collectionName of this.buffers.keys()) {
      flushPromises.push(this.flushBuffer(collectionName));
    }

    await Promise.all(flushPromises);

    console.log('Log ingestion system shutdown complete');
  }
};

// Advanced log analysis and monitoring
const logAnalysisEngine = {
  collections: null,

  async initialize(collections) {
    this.collections = collections;
  },

  async analyzeRecentErrors(timeRangeMinutes = 60) {
    console.log(`Analyzing errors from last ${timeRangeMinutes} minutes...`);

    const cutoffTime = new Date(Date.now() - timeRangeMinutes * 60 * 1000);

    const errorAnalysis = await this.collections.applicationLogs.aggregate([
      {
        $match: {
          timestamp: { $gte: cutoffTime },
          level: { $in: ['ERROR', 'FATAL'] }
        }
      },

      // Group by error patterns
      {
        $group: {
          _id: {
            application: '$application',
            service: '$service',
            errorMessage: {
              $substr: ['$message', 0, 100] // Truncate for grouping
            }
          },

          count: { $sum: 1 },
          firstOccurrence: { $min: '$timestamp' },
          lastOccurrence: { $max: '$timestamp' },
          affectedInstances: { $addToSet: '$instance' },
          affectedUsers: { $addToSet: '$request.user_id' },

          // Sample error details
          sampleErrors: {
            $push: {
              timestamp: '$timestamp',
              message: '$message',
              request_id: '$request.id',
              trace_id: '$trace.trace_id',
              stack: '$error.stack'
            }
          }
        }
      },

      // Calculate error characteristics
      {
        $addFields: {
          duration: {
            $divide: [
              { $subtract: ['$lastOccurrence', '$firstOccurrence'] },
              1000 // Convert to seconds
            ]
          },
          errorRate: {
            $divide: ['$count', timeRangeMinutes] // Errors per minute
          },
          instanceCount: { $size: '$affectedInstances' },
          userCount: { $size: '$affectedUsers' },

          // Take only recent sample errors
          recentSamples: { $slice: ['$sampleErrors', -5] }
        }
      },

      // Sort by error frequency and recency
      {
        $sort: {
          count: -1,
          lastOccurrence: -1
        }
      },

      {
        $limit: 50 // Top 50 error patterns
      },

      // Format for analysis output
      {
        $project: {
          application: '$_id.application',
          service: '$_id.service',
          errorPattern: '$_id.errorMessage',
          count: 1,
          errorRate: { $round: ['$errorRate', 2] },
          duration: { $round: ['$duration', 1] },
          firstOccurrence: 1,
          lastOccurrence: 1,
          instanceCount: 1,
          userCount: 1,
          affectedInstances: 1,
          recentSamples: 1,

          // Severity assessment
          severity: {
            $switch: {
              branches: [
                {
                  case: { $gt: ['$errorRate', 10] }, // > 10 errors/minute
                  then: 'CRITICAL'
                },
                {
                  case: { $gt: ['$errorRate', 5] }, // > 5 errors/minute
                  then: 'HIGH'
                },
                {
                  case: { $gt: ['$errorRate', 1] }, // > 1 error/minute
                  then: 'MEDIUM'
                }
              ],
              default: 'LOW'
            }
          }
        }
      }
    ]).toArray();

    console.log(`Found ${errorAnalysis.length} error patterns`);
    return errorAnalysis;
  },

  async analyzeAccessPatterns(timeRangeMinutes = 30) {
    console.log(`Analyzing access patterns from last ${timeRangeMinutes} minutes...`);

    const cutoffTime = new Date(Date.now() - timeRangeMinutes * 60 * 1000);

    const accessAnalysis = await this.collections.accessLogs.aggregate([
      {
        $match: {
          timestamp: { $gte: cutoffTime }
        }
      },

      // Group by endpoint and status
      {
        $group: {
          _id: {
            endpoint: '$endpoint',
            method: '$method',
            statusClass: {
              $switch: {
                branches: [
                  { case: { $lt: ['$status_code', 300] }, then: '2xx' },
                  { case: { $lt: ['$status_code', 400] }, then: '3xx' },
                  { case: { $lt: ['$status_code', 500] }, then: '4xx' },
                  { case: { $gte: ['$status_code', 500] }, then: '5xx' }
                ],
                default: 'unknown'
              }
            }
          },

          requestCount: { $sum: 1 },
          avgDuration: { $avg: '$duration_ms' },
          minDuration: { $min: '$duration_ms' },
          maxDuration: { $max: '$duration_ms' },

          // Percentile approximations
          durations: { $push: '$duration_ms' },

          totalResponseSize: { $sum: '$response_size' },
          uniqueClients: { $addToSet: '$client.ip' },
          uniqueUsers: { $addToSet: '$client.user_id' },

          // Error details for non-2xx responses
          errorSamples: {
            $push: {
              $cond: [
                { $gte: ['$status_code', 400] },
                {
                  timestamp: '$timestamp',
                  status: '$status_code',
                  client_ip: '$client.ip',
                  user_id: '$client.user_id',
                  duration: '$duration_ms'
                },
                null
              ]
            }
          }
        }
      },

      // Calculate additional metrics
      {
        $addFields: {
          requestsPerMinute: { $divide: ['$requestCount', timeRangeMinutes] },
          avgResponseSize: { $divide: ['$totalResponseSize', '$requestCount'] },
          uniqueClientCount: { $size: '$uniqueClients' },
          uniqueUserCount: { $size: '$uniqueUsers' },

          // Filter out null error samples
          errorSamples: {
            $filter: {
              input: '$errorSamples',
              cond: { $ne: ['$$this', null] }
            }
          },

          // Approximate percentiles (simplified)
          p95Duration: {
            $let: {
              vars: {
                sortedDurations: {
                  $sortArray: {
                    input: '$durations',
                    sortBy: 1
                  }
                }
              },
              in: {
                $arrayElemAt: [
                  '$$sortedDurations',
                  { $floor: { $multiply: [{ $size: '$$sortedDurations' }, 0.95] } }
                ]
              }
            }
          }
        }
      },

      // Sort by request volume
      {
        $sort: {
          requestCount: -1
        }
      },

      {
        $limit: 100 // Top 100 endpoints
      },

      // Format output
      {
        $project: {
          endpoint: '$_id.endpoint',
          method: '$_id.method',
          statusClass: '$_id.statusClass',
          requestCount: 1,
          requestsPerMinute: { $round: ['$requestsPerMinute', 2] },
          avgDuration: { $round: ['$avgDuration', 1] },
          minDuration: 1,
          maxDuration: 1,
          p95Duration: { $round: ['$p95Duration', 1] },
          avgResponseSize: { $round: ['$avgResponseSize', 0] },
          uniqueClientCount: 1,
          uniqueUserCount: 1,
          errorSamples: { $slice: ['$errorSamples', 5] }, // Recent 5 errors

          // Performance assessment
          performanceStatus: {
            $switch: {
              branches: [
                {
                  case: { $gt: ['$avgDuration', 5000] }, // > 5 seconds
                  then: 'SLOW'
                },
                {
                  case: { $gt: ['$avgDuration', 2000] }, // > 2 seconds
                  then: 'WARNING'
                }
              ],
              default: 'NORMAL'
            }
          }
        }
      }
    ]).toArray();

    console.log(`Analyzed ${accessAnalysis.length} endpoint patterns`);
    return accessAnalysis;
  },

  async generatePerformanceReport(timeRangeMinutes = 60) {
    console.log(`Generating performance report for last ${timeRangeMinutes} minutes...`);

    const cutoffTime = new Date(Date.now() - timeRangeMinutes * 60 * 1000);

    const performanceReport = await this.collections.performanceMetrics.aggregate([
      {
        $match: {
          timestamp: { $gte: cutoffTime }
        }
      },

      // Group by metric type and instance
      {
        $group: {
          _id: {
            metricType: '$metric_type',
            metricName: '$metric_name',
            instanceId: '$instance_id'
          },

          sampleCount: { $sum: 1 },
          avgValue: { $avg: '$value' },
          minValue: { $min: '$value' },
          maxValue: { $max: '$value' },
          latestValue: { $last: '$value' },

          // Time series data for trending
          timeSeries: {
            $push: {
              timestamp: '$timestamp',
              value: '$value'
            }
          },

          // Alert information
          alertCount: {
            $sum: {
              $cond: [
                {
                  $or: [
                    { $gte: ['$value', '$alerts.critical_threshold'] },
                    { $gte: ['$value', '$alerts.warning_threshold'] }
                  ]
                },
                1,
                0
              ]
            }
          }
        }
      },

      // Calculate trend and status
      {
        $addFields: {
          // Simple trend calculation (comparing first and last values)
          trend: {
            $let: {
              vars: {
                firstValue: { $arrayElemAt: ['$timeSeries', 0] },
                lastValue: { $arrayElemAt: ['$timeSeries', -1] }
              },
              in: {
                $cond: [
                  { $gt: ['$$lastValue.value', '$$firstValue.value'] },
                  'INCREASING',
                  {
                    $cond: [
                      { $lt: ['$$lastValue.value', '$$firstValue.value'] },
                      'DECREASING',
                      'STABLE'
                    ]
                  }
                ]
              }
            }
          },

          // Alert status
          alertStatus: {
            $cond: [
              { $gt: ['$alertCount', 0] },
              'ALERTS_TRIGGERED',
              'NORMAL'
            ]
          }
        }
      },

      // Group by metric type for summary
      {
        $group: {
          _id: '$_id.metricType',

          metrics: {
            $push: {
              name: '$_id.metricName',
              instance: '$_id.instanceId',
              sampleCount: '$sampleCount',
              avgValue: '$avgValue',
              minValue: '$minValue',
              maxValue: '$maxValue',
              latestValue: '$latestValue',
              trend: '$trend',
              alertStatus: '$alertStatus',
              alertCount: '$alertCount'
            }
          },

          totalSamples: { $sum: '$sampleCount' },
          instanceCount: { $addToSet: '$_id.instanceId' },
          totalAlerts: { $sum: '$alertCount' }
        }
      },

      {
        $addFields: {
          instanceCount: { $size: '$instanceCount' }
        }
      },

      {
        $sort: { _id: 1 }
      }
    ]).toArray();

    console.log(`Performance report generated for ${performanceReport.length} metric types`);
    return performanceReport;
  },

  async getTailLogs(collectionName, limit = 100) {
    // Get most recent logs (natural order in capped collections)
    const collection = this.collections[this.getCollectionProperty(collectionName)];
    if (!collection) {
      throw new Error(`Collection not found: ${collectionName}`);
    }

    // Capped collections maintain insertion order, so we can use natural order
    const logs = await collection.find()
      .sort({ $natural: -1 }) // Reverse natural order (most recent first)
      .limit(limit)
      .toArray();

    return logs.reverse(); // Return in chronological order (oldest first)
  },

  getCollectionProperty(collectionName) {
    const mapping = {
      'application_logs': 'applicationLogs',
      'error_logs': 'errorLogs', 
      'access_logs': 'accessLogs',
      'performance_metrics': 'performanceMetrics',
      'audit_logs': 'auditLogs'
    };
    return mapping[collectionName];
  }
};

// Benefits of MongoDB Capped Collections:
// - Automatic size management with guaranteed space limits
// - Natural insertion order preservation without indexes
// - Optimized write performance for high-throughput logging
// - Circular buffer behavior with automatic old document removal
// - No fragmentation or maintenance overhead
// - Tailable cursors for real-time log streaming
// - Atomic document rotation without application logic
// - Consistent performance regardless of collection size
// - Integration with MongoDB ecosystem and tools
// - Built-in clustering and replication support

module.exports = {
  createOptimizedCappedCollections,
  logIngestionSystem,
  logAnalysisEngine
};

Understanding MongoDB Capped Collections Architecture

Advanced Capped Collection Management and Patterns

Implement sophisticated capped collection strategies for different logging scenarios:

// Advanced capped collection management system
class CappedCollectionManager {
  constructor(db, options = {}) {
    this.db = db;
    this.options = {
      // Default configurations
      defaultSize: 100 * 1024 * 1024, // 100MB
      retentionPeriods: {
        application_logs: 7 * 24 * 60 * 60 * 1000, // 7 days
        error_logs: 30 * 24 * 60 * 60 * 1000, // 30 days  
        access_logs: 24 * 60 * 60 * 1000, // 24 hours
        audit_logs: 365 * 24 * 60 * 60 * 1000 // 1 year
      },
      ...options
    };

    this.collections = new Map();
    this.tails = new Map();
    this.statistics = new Map();
  }

  async createCappedCollectionHierarchy() {
    // Create hierarchical capped collections for different log levels and retention

    // Critical logs - smallest size, longest retention
    await this.createTieredCollection('critical_logs', {
      size: 50 * 1024 * 1024, // 50MB
      max: 100000,
      retention: 'critical'
    });

    // Error logs - medium size and retention  
    await this.createTieredCollection('error_logs', {
      size: 200 * 1024 * 1024, // 200MB
      max: 500000,
      retention: 'error'
    });

    // Warning logs - larger size, medium retention
    await this.createTieredCollection('warning_logs', {
      size: 300 * 1024 * 1024, // 300MB  
      max: 1000000,
      retention: 'warning'
    });

    // Info logs - large size, shorter retention
    await this.createTieredCollection('info_logs', {
      size: 500 * 1024 * 1024, // 500MB
      max: 2000000, 
      retention: 'info'
    });

    // Debug logs - largest size, shortest retention
    await this.createTieredCollection('debug_logs', {
      size: 1024 * 1024 * 1024, // 1GB
      max: 5000000,
      retention: 'debug'
    });

    // Specialized collections
    await this.createSpecializedCollections();

    console.log('Capped collection hierarchy created');
  }

  async createTieredCollection(name, config) {
    try {
      const collection = await this.db.createCollection(name, {
        capped: true,
        size: config.size,
        max: config.max
      });

      this.collections.set(name, collection);

      // Initialize statistics tracking
      this.statistics.set(name, {
        documentsInserted: 0,
        totalSize: 0,
        lastInsert: null,
        insertRate: 0,
        retentionType: config.retention
      });

      console.log(`Created capped collection: ${name} (${config.size} bytes, max ${config.max} docs)`);

    } catch (error) {
      if (error.code === 48) { // Collection already exists
        console.log(`Capped collection ${name} already exists`);
        const collection = this.db.collection(name);
        this.collections.set(name, collection);
      } else {
        throw error;
      }
    }
  }

  async createSpecializedCollections() {
    // Real-time metrics collection
    await this.createTieredCollection('realtime_metrics', {
      size: 100 * 1024 * 1024, // 100MB
      max: 1000000,
      retention: 'realtime'
    });

    // Security events collection
    await this.createTieredCollection('security_events', {
      size: 50 * 1024 * 1024, // 50MB
      max: 200000,
      retention: 'security'
    });

    // Business events collection  
    await this.createTieredCollection('business_events', {
      size: 200 * 1024 * 1024, // 200MB
      max: 1000000,
      retention: 'business'
    });

    // System health collection
    await this.createTieredCollection('system_health', {
      size: 150 * 1024 * 1024, // 150MB
      max: 500000,
      retention: 'system'
    });

    // Create minimal indexes for specialized queries
    await this.createSpecializedIndexes();
  }

  async createSpecializedIndexes() {
    // Minimal indexes to maintain write performance

    // Real-time metrics - by type and timestamp
    await this.collections.get('realtime_metrics').createIndex({
      metric_type: 1,
      timestamp: -1
    });

    // Security events - by severity and event type
    await this.collections.get('security_events').createIndex({
      severity: 1,
      event_type: 1
    });

    // Business events - by event category
    await this.collections.get('business_events').createIndex({
      category: 1,
      user_id: 1
    });

    // System health - by component and status
    await this.collections.get('system_health').createIndex({
      component: 1,
      status: 1
    });
  }

  async insertWithRouting(logLevel, document) {
    // Route documents to appropriate capped collection based on level
    const routingMap = {
      FATAL: 'critical_logs',
      ERROR: 'error_logs', 
      WARN: 'warning_logs',
      INFO: 'info_logs',
      DEBUG: 'debug_logs',
      TRACE: 'debug_logs'
    };

    const collectionName = routingMap[logLevel] || 'info_logs';
    const collection = this.collections.get(collectionName);

    if (!collection) {
      throw new Error(`Collection not found: ${collectionName}`);
    }

    // Add routing metadata
    const enrichedDocument = {
      ...document,
      _routed_to: collectionName,
      _inserted_at: new Date()
    };

    try {
      const result = await collection.insertOne(enrichedDocument);

      // Update statistics
      this.updateInsertionStatistics(collectionName, enrichedDocument);

      return result;
    } catch (error) {
      console.error(`Error inserting to ${collectionName}:`, error);
      throw error;
    }
  }

  updateInsertionStatistics(collectionName, document) {
    const stats = this.statistics.get(collectionName);
    if (!stats) return;

    stats.documentsInserted++;
    stats.totalSize += this.estimateDocumentSize(document);
    stats.lastInsert = new Date();

    // Calculate insertion rate (documents per second)
    if (stats.documentsInserted > 1) {
      const timeSpan = stats.lastInsert - stats.firstInsert || 1;
      stats.insertRate = (stats.documentsInserted / (timeSpan / 1000)).toFixed(2);
    } else {
      stats.firstInsert = stats.lastInsert;
    }
  }

  estimateDocumentSize(document) {
    // Rough estimation of document size in bytes
    return JSON.stringify(document).length * 2; // UTF-8 approximation
  }

  async setupTailableStreams() {
    // Set up tailable cursors for real-time log streaming
    console.log('Setting up tailable cursors for real-time streaming...');

    for (const [collectionName, collection] of this.collections.entries()) {
      const tail = collection.find().addCursorFlag('tailable', true)
                             .addCursorFlag('awaitData', true);

      this.tails.set(collectionName, tail);

      // Start async processing of tailable cursor
      this.processTailableStream(collectionName, tail);
    }
  }

  async processTailableStream(collectionName, cursor) {
    console.log(`Starting tailable stream for: ${collectionName}`);

    try {
      for await (const document of cursor) {
        // Process real-time log document
        await this.processRealtimeLog(collectionName, document);
      }
    } catch (error) {
      console.error(`Tailable stream error for ${collectionName}:`, error);

      // Attempt to restart the stream
      setTimeout(() => {
        this.restartTailableStream(collectionName);
      }, 5000);
    }
  }

  async processRealtimeLog(collectionName, document) {
    // Real-time processing of log entries
    const stats = this.statistics.get(collectionName);

    // Update real-time statistics
    if (stats) {
      stats.documentsInserted++;
      stats.lastInsert = new Date();
    }

    // Trigger alerts for critical conditions
    if (collectionName === 'critical_logs' || collectionName === 'error_logs') {
      await this.checkForAlertConditions(document);
    }

    // Real-time analytics
    if (collectionName === 'realtime_metrics') {
      await this.updateRealtimeMetrics(document);
    }

    // Security monitoring
    if (collectionName === 'security_events') {
      await this.analyzeSecurityEvent(document);
    }

    // Emit to external systems (WebSocket, message queues, etc.)
    this.emitRealtimeEvent(collectionName, document);
  }

  async checkForAlertConditions(document) {
    // Implement alert logic for critical conditions
    const alertConditions = [
      // High error rate
      document.level === 'ERROR' && document.error_count > 10,

      // Security incidents
      document.category === 'security' && document.severity === 'high',

      // System failures
      document.component === 'database' && document.status === 'down',

      // Performance degradation
      document.metric_type === 'response_time' && document.value > 10000
    ];

    if (alertConditions.some(condition => condition)) {
      await this.triggerAlert({
        type: 'critical_condition',
        document: document,
        timestamp: new Date()
      });
    }
  }

  async triggerAlert(alert) {
    console.log('ALERT TRIGGERED:', JSON.stringify(alert, null, 2));

    // Store alert in dedicated collection
    const alertsCollection = this.db.collection('alerts');
    await alertsCollection.insertOne({
      ...alert,
      _id: new ObjectId(),
      acknowledged: false,
      created_at: new Date()
    });

    // Send external notifications (email, Slack, PagerDuty, etc.)
    // Implementation depends on notification system
  }

  emitRealtimeEvent(collectionName, document) {
    // Emit to WebSocket connections, message queues, etc.
    console.log(`Real-time event: ${collectionName}`, {
      id: document._id,
      timestamp: document._inserted_at || document.timestamp,
      level: document.level,
      message: document.message?.substring(0, 100) + '...'
    });
  }

  async getCollectionStatistics(collectionName) {
    const collection = this.collections.get(collectionName);
    if (!collection) {
      throw new Error(`Collection not found: ${collectionName}`);
    }

    // Get MongoDB collection statistics
    const stats = await this.db.runCommand({ collStats: collectionName });
    const customStats = this.statistics.get(collectionName);

    return {
      // MongoDB statistics
      size: stats.size,
      count: stats.count,
      avgObjSize: stats.avgObjSize,
      storageSize: stats.storageSize,
      capped: stats.capped,
      max: stats.max,
      maxSize: stats.maxSize,

      // Custom statistics
      insertRate: customStats?.insertRate || 0,
      lastInsert: customStats?.lastInsert,
      retentionType: customStats?.retentionType,

      // Calculated metrics
      utilizationPercent: ((stats.size / stats.maxSize) * 100).toFixed(2),
      documentsPerMB: Math.round(stats.count / (stats.size / 1024 / 1024)),

      // Health assessment
      healthStatus: this.assessCollectionHealth(stats, customStats)
    };
  }

  assessCollectionHealth(mongoStats, customStats) {
    const utilizationPercent = (mongoStats.size / mongoStats.maxSize) * 100;
    const timeSinceLastInsert = customStats?.lastInsert ? 
      Date.now() - customStats.lastInsert.getTime() : Infinity;

    if (utilizationPercent > 95) {
      return 'NEAR_CAPACITY';
    } else if (timeSinceLastInsert > 300000) { // 5 minutes
      return 'INACTIVE';
    } else if (customStats?.insertRate > 1000) {
      return 'HIGH_VOLUME';
    } else {
      return 'HEALTHY';
    }
  }

  async performMaintenance() {
    console.log('Performing capped collection maintenance...');

    const maintenanceReport = {
      timestamp: new Date(),
      collections: {},
      recommendations: []
    };

    for (const collectionName of this.collections.keys()) {
      const stats = await this.getCollectionStatistics(collectionName);
      maintenanceReport.collections[collectionName] = stats;

      // Generate recommendations based on statistics
      if (stats.healthStatus === 'NEAR_CAPACITY') {
        maintenanceReport.recommendations.push({
          collection: collectionName,
          type: 'SIZE_WARNING',
          message: `Collection ${collectionName} is at ${stats.utilizationPercent}% capacity`
        });
      }

      if (stats.healthStatus === 'INACTIVE') {
        maintenanceReport.recommendations.push({
          collection: collectionName,
          type: 'INACTIVE_WARNING',
          message: `Collection ${collectionName} has not received data recently`
        });
      }

      if (stats.insertRate > 1000) {
        maintenanceReport.recommendations.push({
          collection: collectionName,
          type: 'HIGH_VOLUME',
          message: `Collection ${collectionName} has high insertion rate: ${stats.insertRate}/sec`
        });
      }
    }

    console.log('Maintenance report generated:', maintenanceReport);
    return maintenanceReport;
  }

  async shutdown() {
    console.log('Shutting down capped collection manager...');

    // Close all tailable cursors
    for (const [collectionName, cursor] of this.tails.entries()) {
      try {
        await cursor.close();
        console.log(`Closed tailable cursor for: ${collectionName}`);
      } catch (error) {
        console.error(`Error closing cursor for ${collectionName}:`, error);
      }
    }

    this.tails.clear();
    this.collections.clear();
    this.statistics.clear();

    console.log('Capped collection manager shutdown complete');
  }
}

// Real-time log aggregation and analysis
class RealtimeLogAggregator {
  constructor(cappedManager) {
    this.cappedManager = cappedManager;
    this.aggregationWindows = new Map();
    this.alertThresholds = {
      errorRate: 0.05, // 5% error rate
      responseTime: 5000, // 5 seconds
      memoryUsage: 0.85, // 85% memory usage
      cpuUsage: 0.90 // 90% CPU usage
    };
  }

  async startRealtimeAggregation() {
    console.log('Starting real-time log aggregation...');

    // Set up sliding window aggregations
    this.startSlidingWindow('error_rate', 300000); // 5-minute window
    this.startSlidingWindow('response_time', 60000); // 1-minute window
    this.startSlidingWindow('throughput', 60000); // 1-minute window
    this.startSlidingWindow('resource_usage', 120000); // 2-minute window

    console.log('Real-time aggregation started');
  }

  startSlidingWindow(metricType, windowSizeMs) {
    const windowData = {
      data: [],
      windowSize: windowSizeMs,
      lastCleanup: Date.now()
    };

    this.aggregationWindows.set(metricType, windowData);

    // Start cleanup interval
    setInterval(() => {
      this.cleanupWindow(metricType);
    }, windowSizeMs / 10); // Cleanup every 1/10th of window size
  }

  cleanupWindow(metricType) {
    const window = this.aggregationWindows.get(metricType);
    if (!window) return;

    const cutoffTime = Date.now() - window.windowSize;
    window.data = window.data.filter(entry => entry.timestamp > cutoffTime);
    window.lastCleanup = Date.now();
  }

  addDataPoint(metricType, value, metadata = {}) {
    const window = this.aggregationWindows.get(metricType);
    if (!window) return;

    window.data.push({
      timestamp: Date.now(),
      value: value,
      metadata: metadata
    });

    // Check for alerts
    this.checkAggregationAlerts(metricType);
  }

  checkAggregationAlerts(metricType) {
    const window = this.aggregationWindows.get(metricType);
    if (!window || window.data.length === 0) return;

    const recentData = window.data.slice(-10); // Last 10 data points
    const avgValue = recentData.reduce((sum, point) => sum + point.value, 0) / recentData.length;

    let alertTriggered = false;
    let alertMessage = '';

    switch (metricType) {
      case 'error_rate':
        if (avgValue > this.alertThresholds.errorRate) {
          alertTriggered = true;
          alertMessage = `High error rate: ${(avgValue * 100).toFixed(2)}%`;
        }
        break;

      case 'response_time':
        if (avgValue > this.alertThresholds.responseTime) {
          alertTriggered = true;
          alertMessage = `High response time: ${avgValue.toFixed(0)}ms`;
        }
        break;

      case 'resource_usage':
        const memoryAlert = recentData.some(p => p.metadata.memory > this.alertThresholds.memoryUsage);
        const cpuAlert = recentData.some(p => p.metadata.cpu > this.alertThresholds.cpuUsage);

        if (memoryAlert || cpuAlert) {
          alertTriggered = true;
          alertMessage = `High resource usage: Memory ${memoryAlert ? 'HIGH' : 'OK'}, CPU ${cpuAlert ? 'HIGH' : 'OK'}`;
        }
        break;
    }

    if (alertTriggered) {
      this.cappedManager.triggerAlert({
        type: 'aggregation_alert',
        metricType: metricType,
        message: alertMessage,
        value: avgValue,
        threshold: this.alertThresholds[metricType] || 'N/A',
        recentData: recentData.slice(-3) // Last 3 data points
      });
    }
  }

  getWindowSummary(metricType) {
    const window = this.aggregationWindows.get(metricType);
    if (!window || window.data.length === 0) {
      return { metricType, dataPoints: 0, summary: null };
    }

    const values = window.data.map(point => point.value);
    const sortedValues = [...values].sort((a, b) => a - b);

    return {
      metricType: metricType,
      dataPoints: window.data.length,
      windowSizeMs: window.windowSize,
      summary: {
        min: Math.min(...values),
        max: Math.max(...values),
        avg: values.reduce((sum, val) => sum + val, 0) / values.length,
        median: sortedValues[Math.floor(sortedValues.length / 2)],
        p95: sortedValues[Math.floor(sortedValues.length * 0.95)],
        p99: sortedValues[Math.floor(sortedValues.length * 0.99)]
      },
      trend: this.calculateTrend(window.data),
      lastUpdate: window.data[window.data.length - 1].timestamp
    };
  }

  calculateTrend(dataPoints) {
    if (dataPoints.length < 2) return 'INSUFFICIENT_DATA';

    const firstHalf = dataPoints.slice(0, Math.floor(dataPoints.length / 2));
    const secondHalf = dataPoints.slice(Math.floor(dataPoints.length / 2));

    const firstHalfAvg = firstHalf.reduce((sum, p) => sum + p.value, 0) / firstHalf.length;
    const secondHalfAvg = secondHalf.reduce((sum, p) => sum + p.value, 0) / secondHalf.length;

    const change = (secondHalfAvg - firstHalfAvg) / firstHalfAvg;

    if (Math.abs(change) < 0.05) return 'STABLE'; // Less than 5% change
    return change > 0 ? 'INCREASING' : 'DECREASING';
  }

  getAllWindowSummaries() {
    const summaries = {};
    for (const metricType of this.aggregationWindows.keys()) {
      summaries[metricType] = this.getWindowSummary(metricType);
    }
    return summaries;
  }
}

SQL-Style Capped Collection Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB Capped Collection management and querying:

-- QueryLeaf capped collection operations with SQL-familiar syntax

-- Create capped collections with size and document limits
CREATE CAPPED COLLECTION application_logs 
WITH (
  size = '1GB',
  max_documents = 10000000,
  auto_rotate = true
);

CREATE CAPPED COLLECTION error_logs 
WITH (
  size = '256MB', 
  max_documents = 1000000
);

CREATE CAPPED COLLECTION access_logs
WITH (
  size = '2GB'
  -- No document limit for maximum throughput
);

-- High-performance log insertion
INSERT INTO application_logs 
VALUES (
  CURRENT_TIMESTAMP,
  'user-service',
  'payment-processor', 
  'prod-instance-01',
  'ERROR',
  'Payment processing failed for transaction tx_12345',

  -- Structured request context
  ROW(
    'req_98765',
    'POST',
    '/api/payments/process',
    'user_54321',
    'sess_abcdef',
    '192.168.1.100'
  ) AS request_context,

  -- Trace information
  ROW(
    'trace_xyz789',
    'span_456',
    'span_123',
    1
  ) AS trace_info,

  -- Error details
  ROW(
    'PaymentValidationError',
    'Invalid payment method: expired_card',
    'PaymentProcessor.validateCard() line 245',
    'PM001'
  ) AS error_details,

  -- Additional data
  JSON_BUILD_OBJECT(
    'transaction_id', 'tx_12345',
    'user_id', 'user_54321', 
    'payment_amount', 299.99,
    'payment_method', 'card_****1234',
    'merchant_id', 'merchant_789'
  ) AS log_data
);

-- Real-time log tailing (most recent entries first)
SELECT 
  timestamp,
  service,
  level,
  message,
  request_context.request_id,
  request_context.user_id,
  trace_info.trace_id,
  error_details.error_code,
  log_data
FROM application_logs
ORDER BY $natural DESC  -- Natural order in capped collections
LIMIT 100;

-- Log analysis with time-based aggregation
WITH recent_logs AS (
  SELECT 
    service,
    level,
    timestamp,
    message,
    request_context.user_id,
    error_details.error_code,

    -- Time bucketing for analysis
    DATE_TRUNC('minute', timestamp) as minute_bucket,
    DATE_TRUNC('hour', timestamp) as hour_bucket
  FROM application_logs
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '4 hours'
),

error_summary AS (
  SELECT 
    service,
    hour_bucket,
    level,
    COUNT(*) as log_count,
    COUNT(DISTINCT request_context.user_id) as affected_users,
    COUNT(DISTINCT error_details.error_code) as unique_errors,

    -- Error patterns
    mode() WITHIN GROUP (ORDER BY error_details.error_code) as most_common_error,
    array_agg(DISTINCT error_details.error_code) as error_codes,

    -- Sample messages for investigation
    array_agg(
      json_build_object(
        'timestamp', timestamp,
        'message', SUBSTRING(message, 1, 100),
        'user_id', request_context.user_id,
        'error_code', error_details.error_code
      ) ORDER BY timestamp DESC
    )[1:5] as recent_samples

  FROM recent_logs
  WHERE level IN ('ERROR', 'FATAL')
  GROUP BY service, hour_bucket, level
),

service_health AS (
  SELECT 
    service,
    hour_bucket,

    -- Overall metrics
    SUM(log_count) as total_logs,
    SUM(log_count) FILTER (WHERE level = 'ERROR') as error_count,
    SUM(log_count) FILTER (WHERE level = 'WARN') as warning_count,
    SUM(affected_users) as total_affected_users,

    -- Error rate calculation
    CASE 
      WHEN SUM(log_count) > 0 THEN 
        (SUM(log_count) FILTER (WHERE level = 'ERROR')::numeric / SUM(log_count)) * 100
      ELSE 0
    END as error_rate_percent,

    -- Service status assessment
    CASE 
      WHEN SUM(log_count) FILTER (WHERE level = 'ERROR') > 100 THEN 'CRITICAL'
      WHEN (SUM(log_count) FILTER (WHERE level = 'ERROR')::numeric / NULLIF(SUM(log_count), 0)) > 0.05 THEN 'DEGRADED'
      WHEN SUM(log_count) FILTER (WHERE level = 'WARN') > 50 THEN 'WARNING'
      ELSE 'HEALTHY'
    END as service_status

  FROM error_summary
  GROUP BY service, hour_bucket
)

SELECT 
  sh.service,
  sh.hour_bucket,
  sh.total_logs,
  sh.error_count,
  sh.warning_count,
  ROUND(sh.error_rate_percent, 2) as error_rate_pct,
  sh.total_affected_users,
  sh.service_status,

  -- Top error details
  es.most_common_error,
  es.unique_errors,
  es.error_codes,
  es.recent_samples,

  -- Trend analysis
  LAG(sh.error_count, 1) OVER (
    PARTITION BY sh.service 
    ORDER BY sh.hour_bucket
  ) as prev_hour_errors,

  sh.error_count - LAG(sh.error_count, 1) OVER (
    PARTITION BY sh.service 
    ORDER BY sh.hour_bucket
  ) as error_count_change

FROM service_health sh
LEFT JOIN error_summary es ON (
  sh.service = es.service AND 
  sh.hour_bucket = es.hour_bucket AND 
  es.level = 'ERROR'
)
WHERE sh.service_status != 'HEALTHY'
ORDER BY sh.service_status DESC, sh.error_rate_percent DESC, sh.hour_bucket DESC;

-- Access log analysis for performance monitoring
WITH access_metrics AS (
  SELECT 
    endpoint,
    method,
    DATE_TRUNC('minute', timestamp) as minute_bucket,

    -- Request metrics
    COUNT(*) as request_count,
    AVG(duration_ms) as avg_duration,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration_ms) as median_duration,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration_ms) as p95_duration,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY duration_ms) as p99_duration,
    MIN(duration_ms) as min_duration,
    MAX(duration_ms) as max_duration,

    -- Status code distribution
    COUNT(*) FILTER (WHERE status_code < 300) as success_count,
    COUNT(*) FILTER (WHERE status_code >= 300 AND status_code < 400) as redirect_count,
    COUNT(*) FILTER (WHERE status_code >= 400 AND status_code < 500) as client_error_count,
    COUNT(*) FILTER (WHERE status_code >= 500) as server_error_count,

    -- Data transfer metrics
    AVG(response_size) as avg_response_size,
    SUM(response_size) as total_response_size,

    -- Client metrics
    COUNT(DISTINCT client.ip) as unique_clients,
    COUNT(DISTINCT client.user_id) as unique_users

  FROM access_logs
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '2 hours'
  GROUP BY endpoint, method, minute_bucket
),

performance_analysis AS (
  SELECT 
    endpoint,
    method,

    -- Aggregated performance metrics
    SUM(request_count) as total_requests,
    AVG(avg_duration) as overall_avg_duration,
    MAX(p95_duration) as max_p95_duration,
    MAX(p99_duration) as max_p99_duration,

    -- Error rates
    (SUM(client_error_count + server_error_count)::numeric / SUM(request_count)) * 100 as error_rate_percent,
    SUM(server_error_count) as total_server_errors,

    -- Throughput metrics
    AVG(request_count) as avg_requests_per_minute,
    MAX(request_count) as peak_requests_per_minute,

    -- Data transfer
    AVG(avg_response_size) as avg_response_size,
    SUM(total_response_size) / (1024 * 1024) as total_mb_transferred,

    -- Client diversity
    AVG(unique_clients) as avg_unique_clients,
    AVG(unique_users) as avg_unique_users,

    -- Performance assessment
    CASE 
      WHEN AVG(avg_duration) > 5000 THEN 'SLOW'
      WHEN AVG(avg_duration) > 2000 THEN 'DEGRADED' 
      WHEN MAX(p95_duration) > 10000 THEN 'INCONSISTENT'
      ELSE 'NORMAL'
    END as performance_status,

    -- Time series data for trending
    array_agg(
      json_build_object(
        'minute', minute_bucket,
        'requests', request_count,
        'avg_duration', avg_duration,
        'p95_duration', p95_duration,
        'error_rate', (client_error_count + server_error_count)::numeric / request_count * 100
      ) ORDER BY minute_bucket
    ) as time_series_data

  FROM access_metrics
  GROUP BY endpoint, method
),

endpoint_ranking AS (
  SELECT *,
    ROW_NUMBER() OVER (ORDER BY total_requests DESC) as request_rank,
    ROW_NUMBER() OVER (ORDER BY error_rate_percent DESC) as error_rank,
    ROW_NUMBER() OVER (ORDER BY overall_avg_duration DESC) as duration_rank
  FROM performance_analysis
)

SELECT 
  endpoint,
  method,
  total_requests,
  ROUND(overall_avg_duration, 1) as avg_duration_ms,
  ROUND(max_p95_duration, 1) as max_p95_ms,
  ROUND(max_p99_duration, 1) as max_p99_ms,
  ROUND(error_rate_percent, 2) as error_rate_pct,
  total_server_errors,
  ROUND(avg_requests_per_minute, 1) as avg_rpm,
  peak_requests_per_minute as peak_rpm,
  ROUND(total_mb_transferred, 1) as total_mb,
  performance_status,

  -- Rankings
  request_rank,
  error_rank, 
  duration_rank,

  -- Alerts and recommendations
  CASE 
    WHEN performance_status = 'SLOW' THEN 'Optimize endpoint performance - average response time exceeds 5 seconds'
    WHEN performance_status = 'DEGRADED' THEN 'Monitor endpoint performance - response times elevated'
    WHEN performance_status = 'INCONSISTENT' THEN 'Investigate performance spikes - P95 latency exceeds 10 seconds'
    WHEN error_rate_percent > 5 THEN 'High error rate detected - investigate client and server errors'
    WHEN total_server_errors > 100 THEN 'Significant server errors detected - check application health'
    ELSE 'Performance within normal parameters'
  END as recommendation,

  time_series_data

FROM endpoint_ranking
WHERE (
  performance_status != 'NORMAL' OR 
  error_rate_percent > 1 OR 
  request_rank <= 20
)
ORDER BY 
  CASE performance_status
    WHEN 'SLOW' THEN 1
    WHEN 'DEGRADED' THEN 2
    WHEN 'INCONSISTENT' THEN 3
    ELSE 4
  END,
  error_rate_percent DESC,
  total_requests DESC;

-- Real-time metrics aggregation from capped collections
CREATE VIEW real_time_metrics AS
WITH metric_windows AS (
  SELECT 
    metric_type,
    metric_name,
    instance_id,

    -- Current values
    LAST_VALUE(value ORDER BY timestamp) as current_value,
    FIRST_VALUE(value ORDER BY timestamp) as first_value,

    -- Statistical aggregations
    AVG(value) as avg_value,
    MIN(value) as min_value,
    MAX(value) as max_value,
    STDDEV_POP(value) as stddev_value,
    COUNT(*) as sample_count,

    -- Trend calculation
    CASE 
      WHEN COUNT(*) >= 2 THEN
        (LAST_VALUE(value ORDER BY timestamp) - FIRST_VALUE(value ORDER BY timestamp)) / 
        NULLIF(FIRST_VALUE(value ORDER BY timestamp), 0) * 100
      ELSE 0
    END as trend_percent,

    -- Alert thresholds
    MAX(alerts.warning_threshold) as warning_threshold,
    MAX(alerts.critical_threshold) as critical_threshold,

    -- Time range
    MIN(timestamp) as window_start,
    MAX(timestamp) as window_end

  FROM performance_metrics
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '5 minutes'
  GROUP BY metric_type, metric_name, instance_id
)

SELECT 
  metric_type,
  metric_name,
  instance_id,
  current_value,
  ROUND(avg_value::numeric, 2) as avg_value,
  min_value,
  max_value,
  ROUND(stddev_value::numeric, 2) as stddev,
  sample_count,
  ROUND(trend_percent::numeric, 1) as trend_pct,

  -- Alert status
  CASE 
    WHEN critical_threshold IS NOT NULL AND current_value >= critical_threshold THEN 'CRITICAL'
    WHEN warning_threshold IS NOT NULL AND current_value >= warning_threshold THEN 'WARNING'
    ELSE 'NORMAL'
  END as alert_status,

  warning_threshold,
  critical_threshold,
  window_start,
  window_end,

  -- Performance assessment
  CASE metric_type
    WHEN 'cpu_percent' THEN 
      CASE WHEN current_value > 90 THEN 'HIGH' 
           WHEN current_value > 70 THEN 'ELEVATED'
           ELSE 'NORMAL' END
    WHEN 'memory_percent' THEN
      CASE WHEN current_value > 85 THEN 'HIGH'
           WHEN current_value > 70 THEN 'ELEVATED' 
           ELSE 'NORMAL' END
    WHEN 'response_time_ms' THEN
      CASE WHEN current_value > 5000 THEN 'SLOW'
           WHEN current_value > 2000 THEN 'ELEVATED'
           ELSE 'NORMAL' END
    ELSE 'NORMAL'
  END as performance_status

FROM metric_windows
ORDER BY 
  CASE alert_status
    WHEN 'CRITICAL' THEN 1
    WHEN 'WARNING' THEN 2
    ELSE 3
  END,
  metric_type,
  metric_name;

-- Capped collection maintenance and monitoring
SELECT 
  collection_name,
  is_capped,
  max_size_bytes / (1024 * 1024) as max_size_mb,
  current_size_bytes / (1024 * 1024) as current_size_mb,
  document_count,
  max_documents,

  -- Utilization metrics
  ROUND((current_size_bytes::numeric / max_size_bytes) * 100, 1) as size_utilization_pct,
  ROUND((document_count::numeric / NULLIF(max_documents, 0)) * 100, 1) as document_utilization_pct,

  -- Health assessment
  CASE 
    WHEN (current_size_bytes::numeric / max_size_bytes) > 0.95 THEN 'NEAR_CAPACITY'
    WHEN (current_size_bytes::numeric / max_size_bytes) > 0.80 THEN 'HIGH_UTILIZATION'
    WHEN document_count = 0 THEN 'EMPTY'
    ELSE 'HEALTHY'
  END as health_status,

  -- Performance metrics
  avg_document_size_bytes,
  ROUND(avg_document_size_bytes / 1024.0, 1) as avg_document_size_kb,

  -- Recommendations
  CASE 
    WHEN (current_size_bytes::numeric / max_size_bytes) > 0.95 THEN 
      'Consider increasing collection size or reducing retention period'
    WHEN document_count = 0 THEN 
      'Collection is empty - verify data ingestion is working'
    WHEN avg_document_size_bytes > 16384 THEN 
      'Large average document size - consider data optimization'
    ELSE 'Collection operating within normal parameters'
  END as recommendation

FROM CAPPED_COLLECTION_STATS()
WHERE is_capped = true
ORDER BY size_utilization_pct DESC;

-- QueryLeaf provides comprehensive capped collection capabilities:
-- 1. SQL-familiar capped collection creation and management
-- 2. High-performance log insertion with structured data support
-- 3. Real-time log tailing and streaming with natural ordering
-- 4. Advanced log analysis with time-based aggregations
-- 5. Access pattern analysis for performance monitoring
-- 6. Real-time metrics aggregation and alerting
-- 7. Capped collection health monitoring and maintenance
-- 8. Integration with MongoDB's circular buffer optimizations
-- 9. Automatic size management without manual intervention
-- 10. Familiar SQL patterns for log analysis and troubleshooting

Best Practices for Capped Collection Implementation

Design Guidelines

Essential practices for optimal capped collection configuration:

Size Planning: Calculate appropriate collection sizes based on expected data volume and retention requirements
Index Strategy: Use minimal indexes to maintain write performance while supporting essential queries
Document Structure: Design documents for optimal compression and query performance
Retention Alignment: Align capped collection sizes with business retention and compliance requirements
Monitoring Setup: Implement continuous monitoring of collection utilization and performance
Alert Configuration: Set up alerts for capacity utilization and performance degradation

Performance and Scalability

Optimize capped collections for high-throughput logging scenarios:

Write Performance: Minimize indexes and use batch insertion for maximum throughput
Tailable Cursors: Leverage tailable cursors for real-time log streaming and processing
Collection Sizing: Balance collection size with query performance and storage efficiency
Replica Set Configuration: Optimize replica set settings for write-heavy workloads
Hardware Considerations: Use fast storage and adequate memory for optimal performance
Network Optimization: Configure network settings for high-volume log ingestion

Conclusion

MongoDB Capped Collections provide purpose-built capabilities for high-performance logging and circular buffer patterns that eliminate the complexity and overhead of traditional database approaches while delivering consistent performance and automatic space management. The natural ordering preservation and optimized write characteristics make capped collections ideal for log processing, event storage, and real-time data applications.

Key Capped Collection benefits include:

Automatic Size Management: Fixed-size collections with automatic document rotation
Write-Optimized Performance: Optimized for high-throughput, sequential write operations
Natural Ordering: Insertion order preservation without additional indexing overhead
Circular Buffer Behavior: Automatic old document removal when size limits are reached
Real-Time Streaming: Tailable cursor support for live log streaming and processing
Operational Simplicity: No manual maintenance or complex rotation procedures required

Whether you're building logging systems, event processors, real-time analytics platforms, or any application requiring circular buffer patterns, MongoDB Capped Collections with QueryLeaf's familiar SQL interface provides the foundation for high-performance data storage. This combination enables you to implement sophisticated logging capabilities while preserving familiar database interaction patterns.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB Capped Collection operations while providing SQL-familiar collection creation, log analysis, and real-time querying syntax. Advanced circular buffer management, performance monitoring, and maintenance operations are seamlessly handled through familiar SQL patterns, making high-performance logging both powerful and accessible.

The integration of native capped collection capabilities with SQL-style operations makes MongoDB an ideal platform for applications requiring both high-performance logging and familiar database interaction patterns, ensuring your logging solutions remain both effective and maintainable as they scale and evolve.

September 18, 2025
25 min read

MongoDB Geospatial Queries and Location-Based Services: SQL-Style Spatial Operations for Modern Applications

Location-aware applications have become fundamental to modern software experiences - from ride-sharing platforms and delivery services to social networks and retail applications. These applications require sophisticated spatial data processing capabilities including proximity searches, route optimization, geofencing, and real-time location tracking that traditional relational databases struggle to handle efficiently.

MongoDB provides comprehensive geospatial functionality with support for 2D and 3D coordinates, multiple coordinate reference systems, and advanced spatial operations. Unlike traditional databases that require complex extensions for spatial data, MongoDB natively supports geospatial indexes, queries, and aggregation operations that can handle billions of location data points with sub-second query performance.

The Traditional Spatial Data Challenge

Relational databases face significant limitations when handling geospatial data and location-based queries:

-- Traditional PostgreSQL/PostGIS approach - complex setup and limited performance
-- Location-based application with spatial data

CREATE EXTENSION IF NOT EXISTS postgis;

-- Store locations with geometry data
CREATE TABLE locations (
    location_id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    category VARCHAR(100),
    address TEXT,
    city VARCHAR(100),
    state VARCHAR(50),
    country VARCHAR(100),

    -- PostGIS geometry column (complex setup required)
    coordinates GEOMETRY(POINT, 4326), -- WGS84 coordinate system

    -- Additional spatial data
    service_area GEOMETRY(POLYGON, 4326), -- Service coverage area
    delivery_zones GEOMETRY(MULTIPOLYGON, 4326), -- Multiple delivery zones

    -- Business data
    rating DECIMAL(3,2),
    total_reviews INTEGER DEFAULT 0,
    is_active BOOLEAN DEFAULT true,
    hours_of_operation JSONB,

    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create spatial indexes (requires PostGIS extension)
CREATE INDEX idx_locations_coordinates ON locations USING GIST (coordinates);
CREATE INDEX idx_locations_service_area ON locations USING GIST (service_area);

-- Store user locations and activities
CREATE TABLE user_locations (
    user_location_id SERIAL PRIMARY KEY,
    user_id INTEGER NOT NULL REFERENCES users(user_id),
    coordinates GEOMETRY(POINT, 4326),
    accuracy_meters DECIMAL(8,2),
    recorded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    activity_type VARCHAR(50), -- 'check-in', 'delivery', 'movement'
    device_info JSONB
);

CREATE INDEX idx_user_locations_coordinates ON user_locations USING GIST (coordinates);
CREATE INDEX idx_user_locations_user_time ON user_locations (user_id, recorded_at);

-- Complex proximity search query
WITH nearby_locations AS (
    SELECT 
        l.location_id,
        l.name,
        l.category,
        l.rating,

        -- Distance calculation in meters
        ST_Distance(
            l.coordinates,
            ST_SetSRID(ST_MakePoint(-122.4194, 37.7749), 4326) -- San Francisco coordinates
        ) as distance_meters,

        -- Check if point is within service area
        ST_Contains(
            l.service_area,
            ST_SetSRID(ST_MakePoint(-122.4194, 37.7749), 4326)
        ) as is_in_service_area,

        -- Convert coordinates back to lat/lng for application
        ST_Y(l.coordinates) as latitude,
        ST_X(l.coordinates) as longitude

    FROM locations l
    WHERE 
        l.is_active = true
        AND ST_DWithin(
            l.coordinates,
            ST_SetSRID(ST_MakePoint(-122.4194, 37.7749), 4326),
            5000 -- 5km radius in meters
        )
),
location_analytics AS (
    -- Add user activity data for locations
    SELECT 
        nl.*,
        COUNT(DISTINCT ul.user_id) as unique_visitors_last_30_days,
        COUNT(ul.user_location_id) as total_activities_last_30_days,
        AVG(ul.accuracy_meters) as avg_location_accuracy
    FROM nearby_locations nl
    LEFT JOIN user_locations ul ON ST_DWithin(
        ST_SetSRID(ST_MakePoint(nl.longitude, nl.latitude), 4326),
        ul.coordinates,
        100 -- Within 100 meters of location
    )
    AND ul.recorded_at >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY nl.location_id, nl.name, nl.category, nl.rating, 
             nl.distance_meters, nl.is_in_service_area, 
             nl.latitude, nl.longitude
)
SELECT 
    location_id,
    name,
    category,
    rating,
    ROUND(distance_meters::numeric, 0) as distance_meters,
    is_in_service_area,
    latitude,
    longitude,
    unique_visitors_last_30_days,
    total_activities_last_30_days,
    ROUND(avg_location_accuracy::numeric, 1) as avg_accuracy_meters,

    -- Relevance scoring based on distance, rating, and activity
    (
        (1000 - LEAST(distance_meters, 1000)) / 1000 * 0.4 + -- Distance factor (40%)
        (rating / 5.0) * 0.3 + -- Rating factor (30%)
        (LEAST(unique_visitors_last_30_days, 50) / 50.0) * 0.3 -- Activity factor (30%)
    ) as relevance_score

FROM location_analytics
ORDER BY relevance_score DESC, distance_meters ASC
LIMIT 20;

-- Problems with traditional spatial approach:
-- 1. Complex PostGIS extension setup and maintenance
-- 2. Requires specialized spatial database knowledge
-- 3. Limited coordinate system support without additional configuration
-- 4. Performance degrades with large datasets and complex queries
-- 5. Difficult integration with application object models
-- 6. Complex geometry data types and manipulation functions
-- 7. Limited aggregation capabilities for spatial analytics
-- 8. Challenging horizontal scaling for global applications
-- 9. Memory-intensive spatial operations
-- 10. Complex backup and restore procedures for spatial data

-- MySQL spatial limitations (even more restrictive):
CREATE TABLE locations_mysql (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255),
    -- MySQL spatial support limited and less capable
    coordinates POINT NOT NULL,
    SPATIAL INDEX(coordinates)
);

-- Basic proximity query in MySQL (limited functionality)
SELECT 
    id, name,
    ST_Distance_Sphere(
        coordinates, 
        POINT(-122.4194, 37.7749)
    ) as distance_meters
FROM locations_mysql
WHERE ST_Distance_Sphere(
    coordinates, 
    POINT(-122.4194, 37.7749)
) < 5000
ORDER BY distance_meters
LIMIT 10;

-- MySQL limitations:
-- - Limited spatial functions compared to PostGIS
-- - Poor performance with large spatial datasets
-- - No advanced spatial analytics capabilities
-- - Limited coordinate system support
-- - Basic geometry types only
-- - No spatial aggregation functions
-- - Difficult to implement complex spatial business logic

MongoDB provides comprehensive geospatial capabilities with simple, intuitive syntax:

// MongoDB native geospatial support - powerful and intuitive
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('location_services');

// MongoDB geospatial document structure - native and flexible
const createLocationServiceDataModel = async () => {
  // Create locations collection with rich geospatial data
  const locations = db.collection('locations');

  // Example location document with geospatial data
  const locationDocument = {
    _id: new ObjectId(),

    // Basic business information
    name: "Blue Bottle Coffee - Ferry Building",
    category: "cafe",
    subcategory: "specialty_coffee",
    chain: "Blue Bottle Coffee",

    // Address information
    address: {
      street: "1 Ferry Building",
      unit: "Shop 7",
      city: "San Francisco",
      state: "CA",
      country: "USA",
      postalCode: "94111",
      formattedAddress: "1 Ferry Building, Shop 7, San Francisco, CA 94111"
    },

    // Primary location - GeoJSON Point format
    location: {
      type: "Point",
      coordinates: [-122.3937, 37.7955] // [longitude, latitude] - NOTE: MongoDB uses [lng, lat]
    },

    // Service area - GeoJSON Polygon format
    serviceArea: {
      type: "Polygon",
      coordinates: [[
        [-122.4050, 37.7850], // Southwest corner
        [-122.3850, 37.7850], // Southeast corner  
        [-122.3850, 37.8050], // Northeast corner
        [-122.4050, 37.8050], // Northwest corner
        [-122.4050, 37.7850]  // Close polygon
      ]]
    },

    // Multiple delivery zones - GeoJSON MultiPolygon
    deliveryZones: {
      type: "MultiPolygon", 
      coordinates: [
        [[ // First delivery zone
          [-122.4000, 37.7900],
          [-122.3900, 37.7900],
          [-122.3900, 37.8000],
          [-122.4000, 37.8000],
          [-122.4000, 37.7900]
        ]],
        [[ // Second delivery zone
          [-122.4100, 37.7800],
          [-122.3950, 37.7800],
          [-122.3950, 37.7900],
          [-122.4100, 37.7900],
          [-122.4100, 37.7800]
        ]]
      ]
    },

    // Business information
    business: {
      rating: 4.6,
      totalReviews: 1247,
      priceRange: "$$",
      phoneNumber: "+1-415-555-0123",
      website: "https://bluebottlecoffee.com",
      isActive: true,
      isChain: true,

      // Hours of operation with geospatial considerations
      hours: {
        monday: { open: "06:00", close: "19:00", timezone: "America/Los_Angeles" },
        tuesday: { open: "06:00", close: "19:00", timezone: "America/Los_Angeles" },
        wednesday: { open: "06:00", close: "19:00", timezone: "America/Los_Angeles" },
        thursday: { open: "06:00", close: "19:00", timezone: "America/Los_Angeles" },
        friday: { open: "06:00", close: "20:00", timezone: "America/Los_Angeles" },
        saturday: { open: "07:00", close: "20:00", timezone: "America/Los_Angeles" },
        sunday: { open: "07:00", close: "19:00", timezone: "America/Los_Angeles" }
      },

      // Services and amenities
      amenities: ["wifi", "outdoor_seating", "takeout", "delivery", "mobile_payment"],
      specialties: ["single_origin", "cold_brew", "espresso", "pour_over"]
    },

    // Geospatial metadata
    geoMetadata: {
      coordinateSystem: "WGS84",
      accuracyMeters: 5,
      elevationMeters: 15,
      dataSource: "GPS_verified",
      lastVerified: new Date("2024-09-01"),

      // Nearby landmarks for context
      nearbyLandmarks: [
        {
          name: "Ferry Building Marketplace",
          distance: 50,
          bearing: "north"
        },
        {
          name: "Embarcadero BART Station", 
          distance: 200,
          bearing: "west"
        }
      ]
    },

    // Analytics and performance data
    analytics: {
      monthlyVisitors: 12500,
      averageVisitDuration: 25, // minutes
      peakHours: ["08:00-09:00", "12:00-13:00", "15:00-16:00"],
      popularDays: ["monday", "tuesday", "wednesday", "friday"],

      // Location-specific metrics
      locationMetrics: {
        averageWalkingTime: 3.5, // minutes from nearest transit
        parkingAvailability: "limited",
        accessibilityRating: 4.2,
        noiseLevel: "moderate",
        crowdLevel: "busy"
      }
    },

    // SEO and discovery
    searchTerms: [
      "coffee shop ferry building", 
      "blue bottle san francisco",
      "specialty coffee embarcadero",
      "third wave coffee downtown sf"
    ],

    tags: ["coffee", "cafe", "specialty", "artisan", "downtown", "waterfront"],

    createdAt: new Date("2024-01-15"),
    updatedAt: new Date("2024-09-14")
  };

  // Insert the location document
  await locations.insertOne(locationDocument);

  // Create geospatial index - 2dsphere for spherical geometry (Earth)
  await locations.createIndex({ location: "2dsphere" });
  await locations.createIndex({ serviceArea: "2dsphere" });
  await locations.createIndex({ deliveryZones: "2dsphere" });

  // Additional indexes for common queries
  await locations.createIndex({ category: 1, "business.rating": -1 });
  await locations.createIndex({ "business.isActive": 1, "location": "2dsphere" });
  await locations.createIndex({ tags: 1, "location": "2dsphere" });

  console.log("Location document and indexes created successfully");
  return locations;
};

// Advanced geospatial queries and operations
const performGeospatialOperations = async () => {
  const locations = db.collection('locations');

  // 1. Proximity Search - Find nearby locations
  console.log("=== Proximity Search ===");
  const userLocation = [-122.4194, 37.7749]; // San Francisco coordinates [lng, lat]

  const nearbyLocations = await locations.find({
    location: {
      $near: {
        $geometry: {
          type: "Point",
          coordinates: userLocation
        },
        $maxDistance: 5000, // 5km in meters
        $minDistance: 0
      }
    },
    "business.isActive": true
  }).limit(10).toArray();

  console.log(`Found ${nearbyLocations.length} locations within 5km`);

  // 2. Geo Within - Find locations within a specific area
  console.log("\n=== Geo Within Search ===");
  const searchPolygon = {
    type: "Polygon", 
    coordinates: [[
      [-122.4270, 37.7609], // Southwest corner
      [-122.3968, 37.7609], // Southeast corner
      [-122.3968, 37.7908], // Northeast corner  
      [-122.4270, 37.7908], // Northwest corner
      [-122.4270, 37.7609]  // Close polygon
    ]]
  };

  const locationsInArea = await locations.find({
    location: {
      $geoWithin: {
        $geometry: searchPolygon
      }
    },
    category: "restaurant"
  }).toArray();

  console.log(`Found ${locationsInArea.length} restaurants in specified area`);

  // 3. Geospatial Aggregation - Complex analytics
  console.log("\n=== Geospatial Analytics ===");
  const geospatialAnalytics = await locations.aggregate([
    // Match active locations
    {
      $match: {
        "business.isActive": true,
        location: {
          $geoWithin: {
            $centerSphere: [userLocation, 10 / 3963.2] // 10 miles radius
          }
        }
      }
    },

    // Calculate distance from user location
    {
      $addFields: {
        distanceFromUser: {
          $divide: [
            {
              $sqrt: {
                $add: [
                  {
                    $pow: [
                      { $subtract: [{ $arrayElemAt: ["$location.coordinates", 0] }, userLocation[0]] },
                      2
                    ]
                  },
                  {
                    $pow: [
                      { $subtract: [{ $arrayElemAt: ["$location.coordinates", 1] }, userLocation[1]] },
                      2
                    ]
                  }
                ]
              }
            },
            0.000009 // Approximate degrees to meters conversion
          ]
        }
      }
    },

    // Group by category and analyze
    {
      $group: {
        _id: "$category",
        totalLocations: { $sum: 1 },
        averageRating: { $avg: "$business.rating" },
        averageDistance: { $avg: "$distanceFromUser" },
        closestLocation: {
          $min: {
            name: "$name",
            distance: "$distanceFromUser",
            coordinates: "$location.coordinates"
          }
        },

        // Collect all locations in category
        locations: {
          $push: {
            name: "$name",
            rating: "$business.rating",
            distance: "$distanceFromUser",
            coordinates: "$location.coordinates"
          }
        },

        // Rating distribution
        highRatedCount: {
          $sum: { $cond: [{ $gte: ["$business.rating", 4.5] }, 1, 0] }
        },
        mediumRatedCount: {
          $sum: { $cond: [{ $and: [{ $gte: ["$business.rating", 3.5] }, { $lt: ["$business.rating", 4.5] }] }, 1, 0] }
        },
        lowRatedCount: {
          $sum: { $cond: [{ $lt: ["$business.rating", 3.5] }, 1, 0] }
        }
      }
    },

    // Calculate additional metrics
    {
      $addFields: {
        categoryDensity: { $divide: ["$totalLocations", 314] }, // per square km (10 mile radius ≈ 314 sq km)
        highRatedPercentage: { $multiply: [{ $divide: ["$highRatedCount", "$totalLocations"] }, 100] },
        averageDistanceKm: { $multiply: ["$averageDistance", 111] } // Rough conversion to km
      }
    },

    // Sort by total locations and rating
    {
      $sort: {
        totalLocations: -1,
        averageRating: -1
      }
    },

    // Format output
    {
      $project: {
        category: "$_id",
        totalLocations: 1,
        averageRating: { $round: ["$averageRating", 2] },
        averageDistanceKm: { $round: ["$averageDistanceKm", 2] },
        categoryDensity: { $round: ["$categoryDensity", 2] },
        highRatedPercentage: { $round: ["$highRatedPercentage", 1] },
        closestLocation: 1,
        ratingDistribution: {
          high: "$highRatedCount",
          medium: "$mediumRatedCount", 
          low: "$lowRatedCount"
        }
      }
    }
  ]).toArray();

  console.log("Geospatial Analytics Results:");
  console.log(JSON.stringify(geospatialAnalytics, null, 2));

  // 4. Route optimization - Find optimal path through multiple locations
  console.log("\n=== Route Optimization ===");
  const waypointLocations = [
    [-122.4194, 37.7749], // Start: San Francisco
    [-122.4094, 37.7849], // Waypoint 1
    [-122.3994, 37.7949], // Waypoint 2
    [-122.4194, 37.7749]  // End: Back to start
  ];

  // Find locations near each waypoint
  const routeAnalysis = await Promise.all(
    waypointLocations.map(async (waypoint, index) => {
      const nearbyOnRoute = await locations.find({
        location: {
          $near: {
            $geometry: {
              type: "Point",
              coordinates: waypoint
            },
            $maxDistance: 500 // 500m radius
          }
        },
        "business.isActive": true
      }).limit(5).toArray();

      return {
        waypointIndex: index,
        coordinates: waypoint,
        nearbyLocations: nearbyOnRoute.map(loc => ({
          name: loc.name,
          category: loc.category,
          rating: loc.business.rating,
          coordinates: loc.location.coordinates
        }))
      };
    })
  );

  console.log("Route Analysis:");
  console.log(JSON.stringify(routeAnalysis, null, 2));

  return {
    nearbyLocations: nearbyLocations.length,
    locationsInArea: locationsInArea.length,
    analyticsResults: geospatialAnalytics.length,
    routeWaypoints: routeAnalysis.length
  };
};

// Real-time location tracking and geofencing
const setupLocationTracking = async () => {
  const userLocations = db.collection('user_locations');
  const geofences = db.collection('geofences');

  // Create user location tracking document
  const userLocationDocument = {
    _id: new ObjectId(),
    userId: new ObjectId("64a1b2c3d4e5f6789012347a"),

    // Current location
    currentLocation: {
      type: "Point",
      coordinates: [-122.4194, 37.7749]
    },

    // Location metadata
    locationMetadata: {
      accuracy: 10, // meters
      altitude: 15, // meters above sea level
      heading: 45, // degrees from north
      speed: 1.5, // meters per second
      timestamp: new Date(),
      source: "GPS", // GPS, WiFi, Cellular, Manual
      batteryLevel: 85,

      // Device context
      device: {
        platform: "iOS",
        version: "17.1",
        model: "iPhone 15 Pro",
        appVersion: "2.1.0"
      }
    },

    // Location history (recent positions)
    locationHistory: [
      {
        location: {
          type: "Point", 
          coordinates: [-122.4204, 37.7739]
        },
        timestamp: new Date(Date.now() - 300000), // 5 minutes ago
        accuracy: 15,
        source: "GPS"
      },
      {
        location: {
          type: "Point",
          coordinates: [-122.4214, 37.7729] 
        },
        timestamp: new Date(Date.now() - 600000), // 10 minutes ago
        accuracy: 12,
        source: "GPS"
      }
    ],

    // Privacy and permissions
    privacy: {
      shareLocation: true,
      accuracyLevel: "precise", // precise, approximate, city
      shareWithFriends: true,
      shareWithBusiness: false,
      trackingEnabled: true
    },

    // Activity context
    activity: {
      type: "walking", // walking, driving, cycling, stationary
      confidence: 0.85,
      detectedTransition: null,
      lastActivity: "stationary"
    },

    createdAt: new Date(),
    updatedAt: new Date()
  };

  // Create indexes for location tracking
  await userLocations.createIndex({ currentLocation: "2dsphere" });
  await userLocations.createIndex({ userId: 1, "locationMetadata.timestamp": -1 });
  await userLocations.createIndex({ "locationHistory.location": "2dsphere" });

  await userLocations.insertOne(userLocationDocument);

  // Create geofence system
  const geofenceDocument = {
    _id: new ObjectId(),
    name: "Downtown Coffee Shop Promo Zone",
    description: "Special promotions for coffee shops in downtown area",

    // Geofence area
    area: {
      type: "Polygon",
      coordinates: [[
        [-122.4200, 37.7700],
        [-122.4100, 37.7700], 
        [-122.4100, 37.7800],
        [-122.4200, 37.7800],
        [-122.4200, 37.7700]
      ]]
    },

    // Geofence configuration
    config: {
      type: "promotional", // promotional, security, analytics, notification
      radius: null, // For circular geofences
      isActive: true,

      // Trigger conditions
      triggers: {
        onEnter: true,
        onExit: true,
        onDwell: true,
        dwellTimeMinutes: 5,

        // Rate limiting
        minTimeBetweenTriggers: 300, // seconds
        maxTriggersPerDay: 10
      },

      // Actions to take
      actions: {
        notification: {
          enabled: true,
          title: "Coffee Deals Nearby!",
          message: "Check out special offers at local coffee shops",
          deepLink: "app://offers/coffee"
        },
        analytics: {
          trackEntry: true,
          trackExit: true,
          trackDwellTime: true
        },
        webhook: {
          enabled: false,
          url: "https://api.example.com/geofence-trigger",
          method: "POST"
        }
      }
    },

    // Analytics
    analytics: {
      totalEnters: 1456,
      totalExits: 1423,
      avgDwellTimeMinutes: 12.5,
      uniqueUsers: 342,

      // Time-based patterns
      hourlyActivity: {
        "08": 45, "09": 78, "10": 23, "11": 34,
        "12": 89, "13": 67, "14": 45, "15": 56,
        "16": 78, "17": 123, "18": 89, "19": 34
      },

      dailyActivity: {
        "monday": 234, "tuesday": 189, "wednesday": 267,
        "thursday": 201, "friday": 298, "saturday": 156, "sunday": 111
      }
    },

    createdAt: new Date("2024-09-01"),
    updatedAt: new Date("2024-09-14")
  };

  await geofences.createIndex({ area: "2dsphere" });
  await geofences.createIndex({ "config.isActive": 1, "config.type": 1 });

  await geofences.insertOne(geofenceDocument);

  // Real-time geofence checking function
  const checkGeofences = async (userId, currentLocation) => {
    console.log("Checking geofences for user location...");

    // Find all active geofences that contain the user's location
    const triggeredGeofences = await geofences.find({
      "config.isActive": true,
      area: {
        $geoIntersects: {
          $geometry: {
            type: "Point",
            coordinates: currentLocation
          }
        }
      }
    }).toArray();

    console.log(`Found ${triggeredGeofences.length} triggered geofences`);

    // Process each triggered geofence
    for (const geofence of triggeredGeofences) {
      console.log(`Processing geofence: ${geofence.name}`);

      // Update analytics
      await geofences.updateOne(
        { _id: geofence._id },
        {
          $inc: { 
            "analytics.totalEnters": 1,
            [`analytics.hourlyActivity.${new Date().getHours().toString().padStart(2, '0')}`]: 1,
            [`analytics.dailyActivity.${new Date().toLocaleDateString('en-US', { weekday: 'long' }).toLowerCase()}`]: 1
          },
          $set: { updatedAt: new Date() }
        }
      );

      // Trigger actions (notifications, webhooks, etc.)
      if (geofence.config.actions.notification.enabled) {
        console.log(`Sending notification: ${geofence.config.actions.notification.title}`);
        // Implementation would send actual notification
      }
    }

    return triggeredGeofences;
  };

  // Test geofence checking
  const testLocation = [-122.4150, 37.7750]; // Point within the geofence
  const triggeredFences = await checkGeofences(userLocationDocument.userId, testLocation);

  return {
    userLocationDocument,
    geofenceDocument,
    triggeredGeofences: triggeredFences.length
  };
};

// Advanced spatial analytics and heatmap generation
const generateSpatialAnalytics = async () => {
  const locations = db.collection('locations');
  const userLocations = db.collection('user_locations');

  console.log("=== Generating Spatial Analytics ===");

  // 1. Location Density Analysis
  const locationDensityAnalysis = await locations.aggregate([
    {
      $match: {
        "business.isActive": true
      }
    },

    // Create grid cells for density analysis
    {
      $addFields: {
        gridCell: {
          lat: {
            $floor: {
              $multiply: [
                { $arrayElemAt: ["$location.coordinates", 1] }, // latitude
                1000 // Create 0.001 degree grid cells (~100m)
              ]
            }
          },
          lng: {
            $floor: {
              $multiply: [
                { $arrayElemAt: ["$location.coordinates", 0] }, // longitude  
                1000
              ]
            }
          }
        }
      }
    },

    // Group by grid cell
    {
      $group: {
        _id: "$gridCell",
        locationCount: { $sum: 1 },
        avgRating: { $avg: "$business.rating" },
        categories: { $push: "$category" },

        // Calculate center point of grid cell
        centerCoordinates: {
          $first: {
            type: "Point",
            coordinates: [
              { $divide: ["$gridCell.lng", 1000] },
              { $divide: ["$gridCell.lat", 1000] }
            ]
          }
        },

        // Business metrics
        totalReviews: { $sum: "$business.totalReviews" },
        uniqueCategories: { $addToSet: "$category" }
      }
    },

    // Calculate density metrics
    {
      $addFields: {
        densityScore: {
          $multiply: [
            "$locationCount",
            { $divide: ["$avgRating", 5] } // Weight by average rating
          ]
        },
        categoryDiversity: { $size: "$uniqueCategories" }
      }
    },

    // Sort by density
    {
      $sort: { densityScore: -1 }
    },

    {
      $limit: 20 // Top 20 densest areas
    },

    {
      $project: {
        gridId: "$_id",
        locationCount: 1,
        densityScore: { $round: ["$densityScore", 2] },
        avgRating: { $round: ["$avgRating", 2] },
        categoryDiversity: 1,
        totalReviews: 1,
        centerCoordinates: 1
      }
    }
  ]).toArray();

  console.log(`Location Density Analysis - Found ${locationDensityAnalysis.length} high-density areas`);

  // 2. User Movement Patterns
  const userMovementAnalysis = await userLocations.aggregate([
    {
      $match: {
        "locationMetadata.timestamp": {
          $gte: new Date(Date.now() - 7 * 24 * 60 * 60 * 1000) // Last 7 days
        }
      }
    },

    // Unwind location history
    { $unwind: "$locationHistory" },

    // Calculate movement vectors
    {
      $addFields: {
        movement: {
          fromLat: { $arrayElemAt: ["$locationHistory.location.coordinates", 1] },
          fromLng: { $arrayElemAt: ["$locationHistory.location.coordinates", 0] },
          toLat: { $arrayElemAt: ["$currentLocation.coordinates", 1] },
          toLng: { $arrayElemAt: ["$currentLocation.coordinates", 0] },
          timestamp: "$locationHistory.timestamp"
        }
      }
    },

    // Calculate distance and bearing
    {
      $addFields: {
        "movement.distance": {
          // Haversine formula approximation
          $multiply: [
            6371000, // Earth radius in meters
            {
              $acos: {
                $add: [
                  {
                    $multiply: [
                      { $sin: { $multiply: [{ $degreesToRadians: "$movement.fromLat" }, 1] } },
                      { $sin: { $multiply: [{ $degreesToRadians: "$movement.toLat" }, 1] } }
                    ]
                  },
                  {
                    $multiply: [
                      { $cos: { $multiply: [{ $degreesToRadians: "$movement.fromLat" }, 1] } },
                      { $cos: { $multiply: [{ $degreesToRadians: "$movement.toLat" }, 1] } },
                      { $cos: {
                        $multiply: [
                          { $degreesToRadians: { $subtract: ["$movement.toLng", "$movement.fromLng"] } },
                          1
                        ]
                      } }
                    ]
                  }
                ]
              }
            }
          ]
        }
      }
    },

    // Group movement patterns
    {
      $group: {
        _id: {
          hour: { $hour: "$movement.timestamp" },
          dayOfWeek: { $dayOfWeek: "$movement.timestamp" }
        },

        totalMovements: { $sum: 1 },
        avgDistance: { $avg: "$movement.distance" },
        totalDistance: { $sum: "$movement.distance" },
        uniqueUsers: { $addToSet: "$userId" },

        // Movement characteristics
        shortMovements: {
          $sum: { $cond: [{ $lt: ["$movement.distance", 100] }, 1, 0] } // < 100m
        },
        mediumMovements: {
          $sum: { $cond: [
            { $and: [
              { $gte: ["$movement.distance", 100] },
              { $lt: ["$movement.distance", 1000] }
            ]}, 1, 0
          ] } // 100m - 1km
        },
        longMovements: {
          $sum: { $cond: [{ $gte: ["$movement.distance", 1000] }, 1, 0] } // > 1km
        }
      }
    },

    // Calculate additional metrics
    {
      $addFields: {
        uniqueUserCount: { $size: "$uniqueUsers" },
        avgMovementsPerUser: { $divide: ["$totalMovements", { $size: "$uniqueUsers" }] },
        movementDistribution: {
          short: { $divide: ["$shortMovements", "$totalMovements"] },
          medium: { $divide: ["$mediumMovements", "$totalMovements"] },
          long: { $divide: ["$longMovements", "$totalMovements"] }
        }
      }
    },

    {
      $sort: { totalMovements: -1 }
    },

    {
      $project: {
        hour: "$_id.hour",
        dayOfWeek: "$_id.dayOfWeek", 
        totalMovements: 1,
        uniqueUserCount: 1,
        avgDistance: { $round: ["$avgDistance", 1] },
        avgMovementsPerUser: { $round: ["$avgMovementsPerUser", 1] },
        movementDistribution: {
          short: { $round: ["$movementDistribution.short", 3] },
          medium: { $round: ["$movementDistribution.medium", 3] },
          long: { $round: ["$movementDistribution.long", 3] }
        }
      }
    }
  ]).toArray();

  console.log(`User Movement Analysis - Analyzed ${userMovementAnalysis.length} time periods`);

  // 3. Geographic Performance Analysis
  const geoPerformanceAnalysis = await locations.aggregate([
    {
      $match: {
        "business.isActive": true,
        "analytics.monthlyVisitors": { $exists: true }
      }
    },

    // Create geographic regions
    {
      $addFields: {
        region: {
          $switch: {
            branches: [
              {
                case: {
                  $and: [
                    { $gte: [{ $arrayElemAt: ["$location.coordinates", 1] }, 37.77] }, // North of 37.77°N
                    { $lte: [{ $arrayElemAt: ["$location.coordinates", 0] }, -122.41] } // West of -122.41°W
                  ]
                },
                then: "Northwest"
              },
              {
                case: {
                  $and: [
                    { $gte: [{ $arrayElemAt: ["$location.coordinates", 1] }, 37.77] },
                    { $gt: [{ $arrayElemAt: ["$location.coordinates", 0] }, -122.41] }
                  ]
                },
                then: "Northeast"
              },
              {
                case: {
                  $and: [
                    { $lt: [{ $arrayElemAt: ["$location.coordinates", 1] }, 37.77] },
                    { $lte: [{ $arrayElemAt: ["$location.coordinates", 0] }, -122.41] }
                  ]
                },
                then: "Southwest"
              },
              {
                case: {
                  $and: [
                    { $lt: [{ $arrayElemAt: ["$location.coordinates", 1] }, 37.77] },
                    { $gt: [{ $arrayElemAt: ["$location.coordinates", 0] }, -122.41] }
                  ]
                },
                then: "Southeast"
              }
            ],
            default: "Other"
          }
        }
      }
    },

    // Group by region and category
    {
      $group: {
        _id: {
          region: "$region",
          category: "$category"
        },

        locationCount: { $sum: 1 },
        avgRating: { $avg: "$business.rating" },
        avgMonthlyVisitors: { $avg: "$analytics.monthlyVisitors" },
        totalMonthlyVisitors: { $sum: "$analytics.monthlyVisitors" },

        // Performance metrics
        highPerformers: {
          $sum: {
            $cond: [
              {
                $and: [
                  { $gte: ["$business.rating", 4.5] },
                  { $gte: ["$analytics.monthlyVisitors", 10000] }
                ]
              }, 1, 0
            ]
          }
        },

        topLocation: {
          $max: {
            name: "$name",
            visitors: "$analytics.monthlyVisitors",
            rating: "$business.rating"
          }
        }
      }
    },

    // Calculate regional metrics
    {
      $group: {
        _id: "$_id.region",

        categories: {
          $push: {
            category: "$_id.category",
            locationCount: "$locationCount",
            avgRating: "$avgRating",
            avgMonthlyVisitors: "$avgMonthlyVisitors",
            totalMonthlyVisitors: "$totalMonthlyVisitors",
            highPerformers: "$highPerformers",
            topLocation: "$topLocation"
          }
        },

        regionalTotals: {
          totalLocations: { $sum: "$locationCount" },
          totalMonthlyVisitors: { $sum: "$totalMonthlyVisitors" },
          totalHighPerformers: { $sum: "$highPerformers" }
        }
      }
    },

    // Sort by total visitors
    {
      $sort: { "regionalTotals.totalMonthlyVisitors": -1 }
    },

    {
      $project: {
        region: "$_id",
        categories: 1,
        regionalTotals: 1,

        // Calculate regional performance metrics
        performanceMetrics: {
          avgVisitorsPerLocation: {
            $divide: ["$regionalTotals.totalMonthlyVisitors", "$regionalTotals.totalLocations"]
          },
          highPerformerRatio: {
            $divide: ["$regionalTotals.totalHighPerformers", "$regionalTotals.totalLocations"]
          }
        }
      }
    }
  ]).toArray();

  console.log(`Geographic Performance Analysis - Analyzed ${geoPerformanceAnalysis.length} regions`);

  return {
    densityAnalysis: locationDensityAnalysis,
    movementAnalysis: userMovementAnalysis,
    performanceAnalysis: geoPerformanceAnalysis,

    summary: {
      densityHotspots: locationDensityAnalysis.length,
      movementPatterns: userMovementAnalysis.length,
      regionalInsights: geoPerformanceAnalysis.length
    }
  };
};

// Benefits of MongoDB Geospatial Features:
// - Native GeoJSON support with automatic validation
// - Multiple coordinate reference systems (2D, 2dsphere)
// - Built-in spatial operators and aggregation functions
// - Automatic spatial indexing with B-tree and R-tree structures
// - Spherical geometry calculations for Earth-based applications
// - Integration with aggregation framework for complex analytics
// - Real-time geofencing and location tracking capabilities
// - Scalable to billions of location data points
// - Simple query syntax compared to PostGIS extensions
// - No additional setup required - works out of the box

module.exports = {
  createLocationServiceDataModel,
  performGeospatialOperations,
  setupLocationTracking,
  generateSpatialAnalytics
};

Understanding MongoDB Geospatial Architecture

Coordinate Systems and Indexing Strategies

MongoDB supports multiple geospatial indexing approaches optimized for different use cases:

// Advanced geospatial indexing and coordinate system management
class GeospatialIndexManager {
  constructor(db) {
    this.db = db;
    this.collections = new Map();
  }

  async setupGeospatialIndexing() {
    // 1. 2dsphere Index - For spherical geometry (Earth-based coordinates)
    const locations = this.db.collection('locations');

    // Create 2dsphere index for GeoJSON objects
    await locations.createIndex({ location: "2dsphere" });

    // Compound index for filtered geospatial queries
    await locations.createIndex({ 
      category: 1, 
      "business.isActive": 1, 
      location: "2dsphere" 
    });

    // Text and geospatial compound index
    await locations.createIndex({
      "$**": "text",
      location: "2dsphere"
    });

    console.log("2dsphere indexes created for global location queries");

    // 2. 2d Index - For flat geometry (game maps, floor plans)
    const gameLocations = this.db.collection('game_locations');

    // 2d index for flat coordinate system (e.g., game world coordinates)
    await gameLocations.createIndex({ position: "2d" });

    // Example game location document
    const gameLocationDoc = {
      _id: new ObjectId(),
      playerId: new ObjectId(),
      characterName: "DragonSlayer42",

      // Flat 2D coordinates for game world
      position: [1250.5, 875.2], // [x, y] coordinates in game units

      // Game-specific data
      level: 45,
      zone: "Enchanted Forest",
      server: "US-East-1",

      // Bounding box for area of influence
      areaOfInfluence: {
        bottomLeft: [1200, 825],
        topRight: [1300, 925]
      },

      lastUpdated: new Date()
    };

    await gameLocations.insertOne(gameLocationDoc);
    console.log("2d index created for flat coordinate system");

    // 3. Specialized indexing for different data patterns
    const trajectories = this.db.collection('vehicle_trajectories');

    // Index for trajectory lines and paths
    await trajectories.createIndex({ route: "2dsphere" });
    await trajectories.createIndex({ vehicleId: 1, timestamp: 1 });

    // Example trajectory document
    const trajectoryDoc = {
      _id: new ObjectId(),
      vehicleId: "TRUCK_001",
      driverId: new ObjectId(),

      // LineString geometry for route
      route: {
        type: "LineString",
        coordinates: [
          [-122.4194, 37.7749], // Start point
          [-122.4184, 37.7759], // Waypoint 1
          [-122.4174, 37.7769], // Waypoint 2
          [-122.4164, 37.7779]  // End point
        ]
      },

      // Route metadata
      routeMetadata: {
        totalDistance: 2.3, // km
        estimatedTime: 8, // minutes
        actualTime: 9.5, // minutes
        fuelUsed: 0.45, // liters
        trafficConditions: "moderate"
      },

      // Time-based tracking
      startTime: new Date("2024-09-18T14:30:00Z"),
      endTime: new Date("2024-09-18T14:39:30Z"),

      // Performance metrics
      metrics: {
        averageSpeed: 14.5, // km/h
        maxSpeed: 25.0,
        idleTime: 45, // seconds
        hardBrakingEvents: 1,
        hardAccelerationEvents: 0
      }
    };

    await trajectories.insertOne(trajectoryDoc);
    console.log("Trajectory tracking setup completed");

    return {
      sphericalIndexes: ["locations.location", "locations.compound"],
      flatIndexes: ["game_locations.position"],
      trajectoryIndexes: ["trajectories.route"]
    };
  }

  async performAdvancedSpatialQueries() {
    const locations = this.db.collection('locations');

    // 1. Multi-stage geospatial aggregation
    console.log("=== Advanced Spatial Aggregation ===");

    const complexSpatialAnalysis = await locations.aggregate([
      // Stage 1: Geospatial filtering
      {
        $geoNear: {
          near: {
            type: "Point",
            coordinates: [-122.4194, 37.7749]
          },
          distanceField: "calculatedDistance",
          maxDistance: 10000, // 10km
          spherical: true,
          query: { "business.isActive": true }
        }
      },

      // Stage 2: Spatial relationship analysis
      {
        $addFields: {
          // Distance categories
          distanceCategory: {
            $switch: {
              branches: [
                { case: { $lte: ["$calculatedDistance", 1000] }, then: "nearby" },
                { case: { $lte: ["$calculatedDistance", 5000] }, then: "moderate" },
                { case: { $lte: ["$calculatedDistance", 10000] }, then: "distant" }
              ],
              default: "very_distant"
            }
          },

          // Spatial density calculation
          spatialDensity: {
            $divide: ["$analytics.monthlyVisitors", { $add: ["$calculatedDistance", 1] }]
          }
        }
      },

      // Stage 3: Complex geospatial grouping
      {
        $group: {
          _id: {
            category: "$category",
            distanceCategory: "$distanceCategory"
          },

          locations: { $push: "$$ROOT" },
          avgDistance: { $avg: "$calculatedDistance" },
          avgRating: { $avg: "$business.rating" },
          avgDensity: { $avg: "$spatialDensity" },
          count: { $sum: 1 },

          // Geospatial aggregations
          centroid: {
            $avg: {
              coordinates: "$location.coordinates"
            }
          },

          // Bounding box calculation
          minLat: { $min: { $arrayElemAt: ["$location.coordinates", 1] } },
          maxLat: { $max: { $arrayElemAt: ["$location.coordinates", 1] } },
          minLng: { $min: { $arrayElemAt: ["$location.coordinates", 0] } },
          maxLng: { $max: { $arrayElemAt: ["$location.coordinates", 0] } }
        }
      },

      // Stage 4: Spatial statistics
      {
        $addFields: {
          boundingBox: {
            type: "Polygon",
            coordinates: [[
              ["$minLng", "$minLat"],
              ["$maxLng", "$minLat"], 
              ["$maxLng", "$maxLat"],
              ["$minLng", "$maxLat"],
              ["$minLng", "$minLat"]
            ]]
          },

          // Geographic spread calculation
          geographicSpread: {
            $sqrt: {
              $add: [
                { $pow: [{ $subtract: ["$maxLat", "$minLat"] }, 2] },
                { $pow: [{ $subtract: ["$maxLng", "$minLng"] }, 2] }
              ]
            }
          }
        }
      },

      {
        $sort: { count: -1, avgDensity: -1 }
      }
    ]).toArray();

    console.log(`Complex Spatial Analysis - ${complexSpatialAnalysis.length} category/distance combinations`);

    // 2. Intersection and overlay queries
    console.log("\n=== Spatial Intersection Analysis ===");

    const intersectionAnalysis = await locations.aggregate([
      {
        $match: {
          "business.isActive": true,
          deliveryZones: { $exists: true }
        }
      },

      // Find intersections between delivery zones
      {
        $lookup: {
          from: "locations",
          let: { currentZones: "$deliveryZones" },
          pipeline: [
            {
              $match: {
                $expr: {
                  $and: [
                    { $ne: ["$_id", "$$ROOT._id"] }, // Different location
                    { $ne: ["$$currentZones", null] },
                    {
                      $gt: [{
                        $size: {
                          $filter: {
                            input: "$deliveryZones.coordinates",
                            cond: {
                              // Simplified intersection check
                              $anyElementTrue: {
                                $map: {
                                  input: "$$currentZones.coordinates",
                                  in: { $ne: ["$$this", null] }
                                }
                              }
                            }
                          }
                        }
                      }, 0]
                    }
                  ]
                }
              }
            },
            {
              $project: {
                name: 1,
                category: 1,
                "business.rating": 1
              }
            }
          ],
          as: "overlappingLocations"
        }
      },

      // Calculate overlap metrics
      {
        $addFields: {
          overlapCount: { $size: "$overlappingLocations" },
          hasOverlap: { $gt: [{ $size: "$overlappingLocations" }, 0] },
          competitionLevel: {
            $switch: {
              branches: [
                { case: { $gte: [{ $size: "$overlappingLocations" }, 5] }, then: "high" },
                { case: { $gte: [{ $size: "$overlappingLocations" }, 2] }, then: "medium" },
                { case: { $gt: [{ $size: "$overlappingLocations" }, 0] }, then: "low" }
              ],
              default: "none"
            }
          }
        }
      },

      {
        $match: { hasOverlap: true }
      },

      {
        $group: {
          _id: "$category",
          avgOverlapCount: { $avg: "$overlapCount" },
          locationsWithOverlap: { $sum: 1 },
          highCompetitionAreas: {
            $sum: { $cond: [{ $eq: ["$competitionLevel", "high"] }, 1, 0] }
          }
        }
      },

      { $sort: { avgOverlapCount: -1 } }
    ]).toArray();

    console.log(`Intersection Analysis - ${intersectionAnalysis.length} categories with delivery zone overlaps`);

    // 3. Temporal-spatial analysis
    console.log("\n=== Temporal-Spatial Analysis ===");

    const temporalSpatialAnalysis = await this.db.collection('user_locations').aggregate([
      {
        $match: {
          "locationMetadata.timestamp": {
            $gte: new Date(Date.now() - 24 * 60 * 60 * 1000) // Last 24 hours
          }
        }
      },

      // Unwind location history for temporal analysis
      { $unwind: "$locationHistory" },

      // Create time buckets
      {
        $addFields: {
          timeBucket: {
            $dateTrunc: {
              date: "$locationHistory.timestamp",
              unit: "hour"
            }
          },

          // Grid cell for spatial grouping
          spatialGrid: {
            lat: {
              $floor: {
                $multiply: [
                  { $arrayElemAt: ["$locationHistory.location.coordinates", 1] },
                  1000 // 0.001 degree precision
                ]
              }
            },
            lng: {
              $floor: {
                $multiply: [
                  { $arrayElemAt: ["$locationHistory.location.coordinates", 0] },
                  1000
                ]
              }
            }
          }
        }
      },

      // Group by time and space
      {
        $group: {
          _id: {
            timeBucket: "$timeBucket",
            spatialGrid: "$spatialGrid"
          },

          uniqueUsers: { $addToSet: "$userId" },
          totalEvents: { $sum: 1 },
          avgAccuracy: { $avg: "$locationHistory.accuracy" },

          // Location cluster center
          centerLat: { $avg: { $arrayElemAt: ["$locationHistory.location.coordinates", 1] } },
          centerLng: { $avg: { $arrayElemAt: ["$locationHistory.location.coordinates", 0] } }
        }
      },

      // Calculate density metrics
      {
        $addFields: {
          userDensity: { $size: "$uniqueUsers" },
          eventDensity: "$totalEvents",
          densityScore: { $multiply: [{ $size: "$uniqueUsers" }, { $log: { $add: ["$totalEvents", 1] } }] }
        }
      },

      // Temporal pattern analysis
      {
        $group: {
          _id: { $hour: "$_id.timeBucket" },

          totalGridCells: { $sum: 1 },
          avgUserDensity: { $avg: "$userDensity" },
          maxUserDensity: { $max: "$userDensity" },
          totalUniqueUsers: { $sum: "$userDensity" },

          // Hotspot identification
          hotspots: {
            $push: {
              $cond: [
                { $gte: ["$densityScore", 10] },
                {
                  center: { type: "Point", coordinates: ["$centerLng", "$centerLat"] },
                  userDensity: "$userDensity",
                  densityScore: "$densityScore"
                },
                null
              ]
            }
          }
        }
      },

      // Clean up hotspots array
      {
        $addFields: {
          hotspots: {
            $filter: {
              input: "$hotspots",
              cond: { $ne: ["$$this", null] }
            }
          }
        }
      },

      { $sort: { "_id": 1 } },

      {
        $project: {
          hour: "$_id",
          totalGridCells: 1,
          avgUserDensity: { $round: ["$avgUserDensity", 2] },
          maxUserDensity: 1,
          totalUniqueUsers: 1,
          hotspotCount: { $size: "$hotspots" },
          topHotspots: { $slice: ["$hotspots", 5] }
        }
      }
    ]).toArray();

    console.log(`Temporal-Spatial Analysis - ${temporalSpatialAnalysis.length} hourly patterns`);

    return {
      complexSpatialResults: complexSpatialAnalysis.length,
      intersectionResults: intersectionAnalysis.length,  
      temporalSpatialResults: temporalSpatialAnalysis.length,

      insights: {
        spatialComplexity: complexSpatialAnalysis,
        deliveryOverlaps: intersectionAnalysis,
        hourlyPatterns: temporalSpatialAnalysis
      }
    };
  }

  async optimizeGeospatialPerformance() {
    console.log("=== Geospatial Performance Optimization ===");

    // 1. Index performance analysis
    const locations = this.db.collection('locations');

    // Test different query patterns
    const performanceTests = [
      {
        name: "Simple Proximity Query",
        query: {
          location: {
            $near: {
              $geometry: { type: "Point", coordinates: [-122.4194, 37.7749] },
              $maxDistance: 5000
            }
          }
        }
      },
      {
        name: "Filtered Proximity Query", 
        query: {
          location: {
            $near: {
              $geometry: { type: "Point", coordinates: [-122.4194, 37.7749] },
              $maxDistance: 5000
            }
          },
          category: "restaurant",
          "business.isActive": true
        }
      },
      {
        name: "Geo Within Query",
        query: {
          location: {
            $geoWithin: {
              $centerSphere: [[-122.4194, 37.7749], 5 / 3963.2] // 5 miles
            }
          }
        }
      }
    ];

    const performanceResults = [];

    for (const test of performanceTests) {
      const startTime = Date.now();

      const results = await locations.find(test.query)
        .limit(20)
        .explain("executionStats");

      const executionTime = Date.now() - startTime;

      performanceResults.push({
        testName: test.name,
        executionTimeMs: executionTime,
        documentsExamined: results.executionStats.totalDocsExamined,
        documentsReturned: results.executionStats.totalDocsReturned,
        indexUsed: results.executionStats.executionStages?.indexName || "none",
        efficiency: results.executionStats.totalDocsReturned / Math.max(results.executionStats.totalDocsExamined, 1)
      });
    }

    console.log("Performance Test Results:");
    performanceResults.forEach(result => {
      console.log(`${result.testName}: ${result.executionTimeMs}ms, Efficiency: ${(result.efficiency * 100).toFixed(1)}%`);
    });

    // 2. Index recommendations
    const indexRecommendations = await this.analyzeIndexUsage(locations);

    // 3. Memory usage optimization
    const memoryOptimization = await this.optimizeMemoryUsage(locations);

    return {
      performanceResults,
      indexRecommendations,
      memoryOptimization,

      recommendations: [
        "Use 2dsphere indexes for Earth-based coordinates",
        "Include commonly filtered fields in compound indexes",
        "Limit result sets with appropriate $maxDistance values", 
        "Use $geoNear aggregation for complex distance-based analytics",
        "Monitor index usage and query patterns regularly"
      ]
    };
  }

  async analyzeIndexUsage(collection) {
    // Get index usage statistics
    const indexStats = await collection.aggregate([
      { $indexStats: {} }
    ]).toArray();

    const recommendations = [];

    indexStats.forEach(stat => {
      const usageRatio = stat.accesses.ops / (stat.accesses.since?.getTime() || 1);

      if (usageRatio < 0.001) {
        recommendations.push({
          type: "remove",
          index: stat.name,
          reason: "Low usage index - consider removing",
          usage: usageRatio
        });
      } else if (usageRatio > 10) {
        recommendations.push({
          type: "optimize",
          index: stat.name, 
          reason: "High usage index - ensure optimal configuration",
          usage: usageRatio
        });
      }
    });

    return {
      totalIndexes: indexStats.length,
      recommendations: recommendations,
      indexStats: indexStats
    };
  }

  async optimizeMemoryUsage(collection) {
    // Analyze document sizes and memory patterns
    const sizeAnalysis = await collection.aggregate([
      {
        $project: {
          documentSize: { $bsonSize: "$$ROOT" },
          hasLocationHistory: { $ne: ["$locationHistory", null] },
          locationHistorySize: { $size: { $ifNull: ["$locationHistory", []] } },
          hasDeliveryZones: { $ne: ["$deliveryZones", null] }
        }
      },
      {
        $group: {
          _id: null,

          avgDocumentSize: { $avg: "$documentSize" },
          maxDocumentSize: { $max: "$documentSize" },
          minDocumentSize: { $min: "$documentSize" },

          largeDocuments: { $sum: { $cond: [{ $gt: ["$documentSize", 16384] }, 1, 0] } }, // > 16KB
          documentsWithHistory: { $sum: { $cond: ["$hasLocationHistory", 1, 0] } },
          avgHistorySize: { $avg: "$locationHistorySize" },

          totalDocuments: { $sum: 1 }
        }
      }
    ]).toArray();

    const analysis = sizeAnalysis[0] || {};

    const optimizationTips = [];

    if (analysis.avgDocumentSize > 8192) {
      optimizationTips.push("Consider splitting large documents or using references");
    }

    if (analysis.avgHistorySize > 100) {
      optimizationTips.push("Limit location history array size or archive old data");
    }

    if (analysis.largeDocuments > analysis.totalDocuments * 0.1) {
      optimizationTips.push("High number of large documents - review document structure");
    }

    return {
      sizeAnalysis: analysis,
      optimizationTips: optimizationTips,

      recommendations: {
        documentSize: "Keep documents under 16MB, optimal under 1MB",
        arrays: "Limit embedded arrays to prevent unbounded growth", 
        indexing: "Use partial indexes for sparse geospatial data",
        sharding: "Consider sharding key that includes geospatial distribution"
      }
    };
  }
}

SQL-Style Geospatial Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB's powerful geospatial capabilities:

-- QueryLeaf geospatial operations with SQL-familiar syntax

-- Create geospatial-enabled table/collection
CREATE TABLE locations (
  id OBJECTID PRIMARY KEY,
  name VARCHAR(255) NOT NULL,
  category VARCHAR(100),

  -- Geospatial columns with native GeoJSON support
  location POINT NOT NULL, -- GeoJSON Point
  service_area POLYGON,    -- GeoJSON Polygon
  delivery_zones MULTIPOLYGON, -- GeoJSON MultiPolygon

  -- Business data
  rating DECIMAL(3,2),
  total_reviews INTEGER DEFAULT 0,
  is_active BOOLEAN DEFAULT true,

  -- Address information
  address DOCUMENT {
    street VARCHAR(255),
    city VARCHAR(100),
    state VARCHAR(50),
    country VARCHAR(100),
    postal_code VARCHAR(20)
  },

  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create geospatial indexes
CREATE SPATIAL INDEX idx_locations_location ON locations (location);
CREATE SPATIAL INDEX idx_locations_service_area ON locations (service_area);
CREATE COMPOUND INDEX idx_locations_category_geo ON locations (category, location);

-- Insert location data with geospatial coordinates
INSERT INTO locations (name, category, location, service_area, address, rating, total_reviews)
VALUES (
  'Blue Bottle Coffee',
  'cafe', 
  ST_POINT(-122.3937, 37.7955), -- Longitude, Latitude
  ST_POLYGON(ARRAY[
    ARRAY[-122.4050, 37.7850], -- Southwest
    ARRAY[-122.3850, 37.7850], -- Southeast  
    ARRAY[-122.3850, 37.8050], -- Northeast
    ARRAY[-122.4050, 37.8050], -- Northwest
    ARRAY[-122.4050, 37.7850]  -- Close polygon
  ]),
  {
    street: '1 Ferry Building',
    city: 'San Francisco',
    state: 'CA',
    country: 'USA',
    postal_code: '94111'
  },
  4.6,
  1247
);

-- Proximity search - find nearby locations
SELECT 
  id,
  name,
  category,
  rating,

  -- Calculate distance in meters
  ST_DISTANCE(location, ST_POINT(-122.4194, 37.7749)) as distance_meters,

  -- Extract coordinates for display
  ST_X(location) as longitude,
  ST_Y(location) as latitude,

  -- Address information
  address.street,
  address.city,
  address.state

FROM locations
WHERE 
  is_active = true
  AND ST_DISTANCE(location, ST_POINT(-122.4194, 37.7749)) <= 5000 -- Within 5km
  AND category IN ('cafe', 'restaurant', 'retail')
ORDER BY ST_DISTANCE(location, ST_POINT(-122.4194, 37.7749))
LIMIT 20;

-- Advanced proximity search with relevance scoring
WITH nearby_locations AS (
  SELECT 
    *,
    ST_DISTANCE(location, ST_POINT(-122.4194, 37.7749)) as distance_meters
  FROM locations
  WHERE 
    is_active = true
    AND ST_DWITHIN(location, ST_POINT(-122.4194, 37.7749), 10000) -- 10km radius
),
scored_locations AS (
  SELECT *,
    -- Relevance scoring: distance (40%) + rating (30%) + reviews (30%)
    (
      (1000 - LEAST(distance_meters, 1000)) / 1000 * 0.4 +
      (rating / 5.0) * 0.3 +
      (LEAST(total_reviews, 1000) / 1000.0) * 0.3
    ) as relevance_score,

    -- Distance categories
    CASE 
      WHEN distance_meters <= 1000 THEN 'nearby'
      WHEN distance_meters <= 5000 THEN 'moderate'
      ELSE 'distant'
    END as distance_category

  FROM nearby_locations
)
SELECT 
  name,
  category, 
  rating,
  total_reviews,
  ROUND(distance_meters) as distance_m,
  distance_category,
  ROUND(relevance_score, 3) as relevance,

  -- Format coordinates for maps
  CONCAT(
    ROUND(ST_Y(location), 6), ',', 
    ROUND(ST_X(location), 6)
  ) as lat_lng

FROM scored_locations
ORDER BY relevance_score DESC, distance_meters ASC
LIMIT 25;

-- Geospatial area queries
SELECT 
  l.name,
  l.category,
  l.rating,

  -- Check if location is within specific area
  ST_CONTAINS(
    ST_POLYGON(ARRAY[
      ARRAY[-122.4270, 37.7609], -- Downtown SF polygon
      ARRAY[-122.3968, 37.7609],
      ARRAY[-122.3968, 37.7908], 
      ARRAY[-122.4270, 37.7908],
      ARRAY[-122.4270, 37.7609]
    ]),
    l.location
  ) as is_in_downtown,

  -- Check service area coverage
  ST_CONTAINS(l.service_area, ST_POINT(-122.4194, 37.7749)) as serves_user_location

FROM locations l
WHERE 
  l.is_active = true
  AND ST_INTERSECTS(
    l.location,
    ST_POLYGON(ARRAY[
      ARRAY[-122.4270, 37.7609],
      ARRAY[-122.3968, 37.7609], 
      ARRAY[-122.3968, 37.7908],
      ARRAY[-122.4270, 37.7908],
      ARRAY[-122.4270, 37.7609]
    ])
  );

-- Complex geospatial analytics with aggregation
WITH location_analytics AS (
  SELECT 
    category,

    -- Spatial clustering analysis
    ST_CLUSTERKMEANS(location, 5) OVER () as cluster_id,

    -- Distance from city center
    ST_DISTANCE(location, ST_POINT(-122.4194, 37.7749)) as distance_from_center,

    -- Geospatial grid for density analysis
    ST_SNAPGRID(location, 0.001, 0.001) as grid_cell,

    name,
    rating,
    total_reviews,
    location

  FROM locations
  WHERE is_active = true
),
cluster_analysis AS (
  SELECT 
    cluster_id,
    category,
    COUNT(*) as location_count,
    AVG(rating) as avg_rating,
    AVG(distance_from_center) as avg_distance_from_center,

    -- Calculate cluster centroid
    ST_CENTROID(ST_COLLECT(location)) as cluster_center,

    -- Calculate cluster bounds
    ST_ENVELOPE(ST_COLLECT(location)) as cluster_bounds,

    -- Business metrics
    SUM(total_reviews) as total_reviews,
    AVG(total_reviews) as avg_reviews_per_location

  FROM location_analytics
  GROUP BY cluster_id, category
),
grid_density AS (
  SELECT 
    grid_cell,
    COUNT(DISTINCT category) as category_diversity,
    COUNT(*) as location_density,
    AVG(rating) as avg_rating,

    -- Calculate grid cell center
    ST_CENTROID(grid_cell) as grid_center

  FROM location_analytics
  GROUP BY grid_cell
  HAVING COUNT(*) >= 3 -- Only dense grid cells
)
SELECT 
  ca.cluster_id,
  ca.category,
  ca.location_count,
  ROUND(ca.avg_rating, 2) as avg_rating,
  ROUND(ca.avg_distance_from_center) as avg_distance_m,

  -- Cluster geographic data
  ST_X(ca.cluster_center) as cluster_lng,
  ST_Y(ca.cluster_center) as cluster_lat,

  -- Calculate cluster area in square meters
  ST_AREA(ca.cluster_bounds, true) as cluster_area_sqm,

  -- Density metrics
  ROUND(ca.location_count / ST_AREA(ca.cluster_bounds, true) * 1000000, 2) as density_per_sqkm,

  -- Business performance
  ca.total_reviews,
  ROUND(ca.avg_reviews_per_location) as avg_reviews,

  -- Nearby high-density areas
  (
    SELECT COUNT(*)
    FROM grid_density gd
    WHERE ST_DISTANCE(ca.cluster_center, gd.grid_center) <= 1000
  ) as nearby_dense_areas

FROM cluster_analysis ca
WHERE ca.location_count >= 2
ORDER BY ca.location_count DESC, ca.avg_rating DESC;

-- Geofencing and real-time location queries
CREATE TABLE geofences (
  id OBJECTID PRIMARY KEY,
  name VARCHAR(255) NOT NULL,
  geofence_area POLYGON NOT NULL,
  geofence_type VARCHAR(50) DEFAULT 'notification',
  is_active BOOLEAN DEFAULT true,

  -- Trigger configuration
  config DOCUMENT {
    on_enter BOOLEAN DEFAULT true,
    on_exit BOOLEAN DEFAULT true,
    on_dwell BOOLEAN DEFAULT false,
    dwell_time_minutes INTEGER DEFAULT 5,
    max_triggers_per_day INTEGER DEFAULT 10
  },

  -- Analytics tracking
  analytics DOCUMENT {
    total_enters INTEGER DEFAULT 0,
    total_exits INTEGER DEFAULT 0,
    unique_users INTEGER DEFAULT 0,
    avg_dwell_minutes DECIMAL(8,2) DEFAULT 0
  },

  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE SPATIAL INDEX idx_geofences_area ON geofences (geofence_area);

-- Check geofence triggers for user location
SELECT 
  gf.id,
  gf.name,
  gf.geofence_type,

  -- Check if user location triggers geofence
  ST_CONTAINS(gf.geofence_area, ST_POINT(-122.4150, 37.7750)) as is_triggered,

  -- Calculate distance to geofence edge
  ST_DISTANCE(
    ST_POINT(-122.4150, 37.7750),
    ST_BOUNDARY(gf.geofence_area)
  ) as distance_to_edge_m,

  -- Geofence area and perimeter
  ST_AREA(gf.geofence_area, true) as area_sqm,
  ST_PERIMETER(gf.geofence_area, true) as perimeter_m,

  -- Configuration and analytics
  gf.config,
  gf.analytics

FROM geofences gf
WHERE 
  gf.is_active = true
  AND (
    ST_CONTAINS(gf.geofence_area, ST_POINT(-122.4150, 37.7750)) -- Inside geofence
    OR ST_DISTANCE(
      ST_POINT(-122.4150, 37.7750), 
      gf.geofence_area
    ) <= 100 -- Within 100m of geofence
  );

-- Time-based geospatial analysis
CREATE TABLE user_location_history (
  id OBJECTID PRIMARY KEY,
  user_id OBJECTID NOT NULL,
  location POINT NOT NULL,
  recorded_at TIMESTAMP NOT NULL,
  accuracy_meters DECIMAL(8,2),
  activity_type VARCHAR(50),

  -- Movement data
  speed_mps DECIMAL(8,2), -- meters per second
  heading_degrees INTEGER, -- 0-360 degrees from north

  -- Context information
  context DOCUMENT {
    battery_level INTEGER,
    connection_type VARCHAR(50),
    app_state VARCHAR(50)
  }
);

CREATE COMPOUND INDEX idx_user_location_time_geo ON user_location_history (
  user_id, recorded_at, location
);

-- Movement pattern analysis
WITH user_movements AS (
  SELECT 
    user_id,
    location,
    recorded_at,

    -- Calculate distance from previous location
    ST_DISTANCE(
      location,
      LAG(location) OVER (
        PARTITION BY user_id 
        ORDER BY recorded_at
      )
    ) as movement_distance,

    -- Time since previous location
    EXTRACT(EPOCH FROM (
      recorded_at - LAG(recorded_at) OVER (
        PARTITION BY user_id 
        ORDER BY recorded_at
      )
    )) as time_elapsed_seconds,

    -- Previous location for trajectory analysis
    LAG(location) OVER (
      PARTITION BY user_id 
      ORDER BY recorded_at
    ) as previous_location

  FROM user_location_history
  WHERE recorded_at >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
),
movement_metrics AS (
  SELECT 
    user_id,
    COUNT(*) as location_points,
    SUM(movement_distance) as total_distance_m,
    AVG(movement_distance / NULLIF(time_elapsed_seconds, 0)) as avg_speed_mps,
    MAX(movement_distance / NULLIF(time_elapsed_seconds, 0)) as max_speed_mps,

    -- Create trajectory line
    ST_MAKELINE(ARRAY_AGG(location ORDER BY recorded_at)) as trajectory,

    -- Calculate bounding box of movement
    ST_ENVELOPE(ST_COLLECT(location)) as movement_bounds,

    -- Time-based metrics
    MIN(recorded_at) as journey_start,
    MAX(recorded_at) as journey_end,
    EXTRACT(EPOCH FROM (MAX(recorded_at) - MIN(recorded_at))) as journey_duration_seconds,

    -- Movement patterns
    COUNT(DISTINCT ST_SNAPGRID(location, 0.001, 0.001)) as unique_areas_visited

  FROM user_movements
  WHERE movement_distance IS NOT NULL
    AND time_elapsed_seconds > 0
    AND movement_distance < 10000 -- Filter out GPS errors
  GROUP BY user_id
)
SELECT 
  user_id,
  location_points,
  ROUND(total_distance_m) as total_distance_m,
  ROUND(total_distance_m / 1000.0, 2) as total_distance_km,
  ROUND(avg_speed_mps * 3.6, 1) as avg_speed_kmh, -- Convert to km/h
  ROUND(max_speed_mps * 3.6, 1) as max_speed_kmh,

  -- Journey characteristics
  journey_start,
  journey_end,
  ROUND(journey_duration_seconds / 3600.0, 1) as journey_hours,
  unique_areas_visited,

  -- Trajectory analysis
  ST_LENGTH(trajectory, true) as trajectory_length_m,
  ST_AREA(movement_bounds, true) as coverage_area_sqm,

  -- Movement efficiency (straight-line vs actual distance)
  ROUND(
    ST_DISTANCE(
      ST_STARTPOINT(trajectory),
      ST_ENDPOINT(trajectory)
    ) / NULLIF(ST_LENGTH(trajectory, true), 0) * 100, 1
  ) as movement_efficiency_pct,

  -- Geographic extent
  ST_XMIN(movement_bounds) as min_longitude,
  ST_XMAX(movement_bounds) as max_longitude, 
  ST_YMIN(movement_bounds) as min_latitude,
  ST_YMAX(movement_bounds) as max_latitude

FROM movement_metrics
WHERE total_distance_m > 100 -- Minimum movement threshold
ORDER BY total_distance_m DESC
LIMIT 50;

-- Location-based recommendations engine
WITH user_preferences AS (
  SELECT 
    u.user_id,
    u.location as current_location,

    -- User preference analysis based on visit history
    up.preferred_categories,
    up.avg_rating_threshold,
    up.max_distance_preference,
    up.price_range_preference

  FROM user_profiles u
  JOIN user_preferences up ON u.user_id = up.user_id
  WHERE u.is_active = true
),
location_scoring AS (
  SELECT 
    l.*,
    up.user_id,

    -- Distance scoring
    ST_DISTANCE(l.location, up.current_location) as distance_m,
    EXP(-ST_DISTANCE(l.location, up.current_location) / 2000.0) as distance_score,

    -- Category preference scoring
    CASE 
      WHEN l.category = ANY(up.preferred_categories) THEN 1.0
      WHEN ARRAY_LENGTH(up.preferred_categories, 1) = 0 THEN 0.5
      ELSE 0.2
    END as category_score,

    -- Rating scoring
    l.rating / 5.0 as rating_score,

    -- Popularity scoring based on reviews
    LN(l.total_reviews + 1) / LN(1000) as popularity_score,

    -- Time-based scoring (open/closed)
    CASE 
      WHEN EXTRACT(DOW FROM CURRENT_TIMESTAMP) = 0 THEN -- Sunday
        CASE WHEN l.hours.sunday.is_open THEN 1.0 ELSE 0.3 END
      WHEN EXTRACT(DOW FROM CURRENT_TIMESTAMP) = 1 THEN -- Monday
        CASE WHEN l.hours.monday.is_open THEN 1.0 ELSE 0.3 END
      -- ... other days
      ELSE 0.8
    END as availability_score

  FROM locations l
  CROSS JOIN user_preferences up
  WHERE 
    l.is_active = true
    AND ST_DISTANCE(l.location, up.current_location) <= up.max_distance_preference
    AND l.rating >= up.avg_rating_threshold
),
final_recommendations AS (
  SELECT *,
    -- Combined relevance score
    (
      distance_score * 0.25 +
      category_score * 0.30 +
      rating_score * 0.20 +
      popularity_score * 0.15 +
      availability_score * 0.10
    ) as relevance_score

  FROM location_scoring
)
SELECT 
  user_id,
  name as location_name,
  category,
  rating,
  total_reviews,
  ROUND(distance_m) as distance_meters,
  ROUND(relevance_score, 3) as relevance,

  -- Location details for display
  ST_X(location) as longitude,
  ST_Y(location) as latitude,
  address.street || ', ' || address.city as display_address,

  -- Recommendation reasoning
  CASE 
    WHEN category_score = 1.0 THEN 'Matches your preferences'
    WHEN distance_score > 0.8 THEN 'Very close to you'
    WHEN rating_score >= 0.9 THEN 'Highly rated'
    WHEN popularity_score > 0.5 THEN 'Popular destination'
    ELSE 'Good option nearby'
  END as recommendation_reason

FROM final_recommendations
WHERE relevance_score > 0.3
ORDER BY user_id, relevance_score DESC
LIMIT 10 PER user_id;

-- QueryLeaf geospatial features provide:
-- 1. Native GeoJSON support with SQL-familiar geometry functions
-- 2. Spatial indexing with automatic optimization for Earth-based coordinates
-- 3. Distance calculations and proximity queries with intuitive syntax
-- 4. Complex geospatial aggregations and analytics using familiar SQL patterns
-- 5. Geofencing capabilities with real-time trigger detection
-- 6. Movement pattern analysis and trajectory tracking
-- 7. Location-based recommendation engines with multi-factor scoring
-- 8. Integration with MongoDB's native geospatial operators and functions
-- 9. Performance optimization through intelligent query planning
-- 10. Seamless scaling from simple proximity queries to complex spatial analytics

Best Practices for Geospatial Implementation

Coordinate System Selection

Choose the appropriate coordinate system and indexing strategy:

2dsphere Index: Use for Earth-based coordinates with spherical geometry calculations
2d Index: Use for flat coordinate systems like game maps or floor plans
Coordinate Format: MongoDB uses [longitude, latitude] format (opposite of many mapping APIs)
Precision Considerations: Balance coordinate precision with storage and performance requirements
Projection Selection: Choose appropriate coordinate reference system for your geographic region
Distance Units: Ensure consistent distance units throughout your application

Performance Optimization

Optimize geospatial queries for high performance and scalability:

Index Strategy: Create compound indexes that support your most common query patterns
Query Limits: Use $maxDistance and $minDistance to limit search scope
Result Pagination: Implement proper pagination for large result sets
Memory Management: Monitor working set size and optimize document structure
Aggregation Optimization: Use $geoNear for distance-based aggregations when possible
Sharding Strategy: Consider geospatial distribution when designing sharding keys

Conclusion

MongoDB geospatial capabilities provide comprehensive location-aware functionality that eliminates the complexity of traditional spatial database extensions while delivering superior performance and scalability. The native support for GeoJSON, multiple coordinate systems, and sophisticated spatial operations makes building location-based applications both powerful and intuitive.

Key geospatial benefits include:

Native Spatial Support: Built-in GeoJSON support without additional extensions or setup
High Performance: Optimized spatial indexing and query execution for billions of documents
Rich Query Capabilities: Comprehensive spatial operators for proximity, intersection, and containment
Flexible Data Models: Store complex location data with business context in single documents
Real-time Processing: Efficient geofencing and location tracking for live applications
Scalable Architecture: Horizontal scaling across distributed clusters with location-aware sharding

Whether you're building ride-sharing platforms, delivery applications, location-based social networks, or IoT sensor networks, MongoDB's geospatial features with QueryLeaf's familiar SQL interface provides the foundation for sophisticated location-aware applications. This combination enables you to implement complex spatial functionality while preserving familiar database interaction patterns.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB geospatial operations while providing SQL-familiar spatial query syntax, coordinate system handling, and geographic analysis functions. Advanced geospatial indexing, proximity calculations, and spatial analytics are seamlessly handled through familiar SQL patterns, making location-based application development both powerful and accessible.

The integration of native geospatial capabilities with SQL-style spatial operations makes MongoDB an ideal platform for applications requiring both sophisticated location functionality and familiar database interaction patterns, ensuring your geospatial solutions remain both effective and maintainable as they scale and evolve.

September 17, 2025
20 min read

MongoDB Time Series Collections and IoT Data Management: SQL-Style Time Series Analytics with High-Performance Data Ingestion

Modern IoT applications generate massive volumes of time-stamped data from sensors, devices, and monitoring systems requiring specialized storage, querying, and analysis capabilities. Traditional relational databases struggle with time series workloads due to their rigid schema requirements, poor compression for temporal data, and inefficient querying patterns for time-based aggregations and analytics.

MongoDB Time Series Collections provide purpose-built capabilities for storing, querying, and analyzing time-stamped data with automatic partitioning, compression, and optimized indexing. Unlike traditional collection storage, time series collections automatically organize data by time ranges, apply sophisticated compression algorithms, and provide specialized query patterns optimized for temporal analytics and IoT workloads.

The Traditional Time Series Challenge

Relational database approaches to time series data have significant performance and scalability limitations:

-- Traditional relational time series design - inefficient and complex

-- PostgreSQL time series approach with partitioning
CREATE TABLE sensor_readings (
    reading_id BIGSERIAL,
    sensor_id VARCHAR(100) NOT NULL,
    device_id VARCHAR(100) NOT NULL,
    location VARCHAR(200),
    timestamp TIMESTAMP NOT NULL,
    temperature DECIMAL(5,2),
    humidity DECIMAL(5,2),
    pressure DECIMAL(7,2),
    battery_level DECIMAL(3,2),
    signal_strength INTEGER,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
) PARTITION BY RANGE (timestamp);

-- Create monthly partitions (manual maintenance required)
CREATE TABLE sensor_readings_2024_01 PARTITION OF sensor_readings
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
CREATE TABLE sensor_readings_2024_02 PARTITION OF sensor_readings
    FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
CREATE TABLE sensor_readings_2024_03 PARTITION OF sensor_readings
    FOR VALUES FROM ('2024-03-01') TO ('2024-04-01');
-- ... manual partition creation for each month

-- Indexes for time series queries
CREATE INDEX idx_sensor_readings_timestamp ON sensor_readings (timestamp);
CREATE INDEX idx_sensor_readings_sensor_id_timestamp ON sensor_readings (sensor_id, timestamp);
CREATE INDEX idx_sensor_readings_device_timestamp ON sensor_readings (device_id, timestamp);

-- Complex time series aggregation query
SELECT 
    sensor_id,
    device_id,
    DATE_TRUNC('hour', timestamp) as hour_bucket,

    -- Statistical aggregations
    COUNT(*) as reading_count,
    AVG(temperature) as avg_temperature,
    MIN(temperature) as min_temperature,
    MAX(temperature) as max_temperature,
    STDDEV(temperature) as temp_stddev,

    AVG(humidity) as avg_humidity,
    AVG(pressure) as avg_pressure,
    AVG(battery_level) as avg_battery,

    -- Time-based calculations
    FIRST_VALUE(temperature) OVER (
        PARTITION BY sensor_id, DATE_TRUNC('hour', timestamp) 
        ORDER BY timestamp
    ) as first_temp,
    LAST_VALUE(temperature) OVER (
        PARTITION BY sensor_id, DATE_TRUNC('hour', timestamp) 
        ORDER BY timestamp 
        RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) as last_temp,

    -- Lag calculations for trends
    LAG(AVG(temperature)) OVER (
        PARTITION BY sensor_id 
        ORDER BY DATE_TRUNC('hour', timestamp)
    ) as prev_hour_avg_temp

FROM sensor_readings
WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    AND sensor_id IN ('TEMP_001', 'TEMP_002', 'TEMP_003')
GROUP BY sensor_id, device_id, DATE_TRUNC('hour', timestamp)
ORDER BY sensor_id, hour_bucket;

-- Problems with traditional time series approach:
-- 1. Manual partition management and maintenance overhead
-- 2. Poor compression ratios for time-stamped data
-- 3. Complex query patterns for time-based aggregations
-- 4. Limited scalability for high-frequency data ingestion
-- 5. Inefficient storage for sparse or irregular time series
-- 6. Difficult downsampling and data retention management
-- 7. Poor performance for cross-time-range analytics
-- 8. Complex indexing strategies for temporal queries

-- InfluxDB-style approach (specialized but limited)
-- INSERT INTO sensor_data,sensor_id=TEMP_001,device_id=DEV_001,location=warehouse_A 
--   temperature=23.5,humidity=65.2,pressure=1013.25,battery_level=85.3 1640995200000000000

-- InfluxDB limitations:
-- - Specialized query language (InfluxQL/Flux) not SQL compatible
-- - Limited JOIN capabilities across measurements
-- - Complex data modeling for hierarchical sensor networks
-- - Difficult integration with existing application stacks
-- - Limited support for complex business logic
-- - Vendor lock-in with proprietary tools and ecosystem
-- - Complex migration paths from existing SQL-based systems

MongoDB Time Series Collections provide comprehensive time series capabilities:

// MongoDB Time Series Collections - purpose-built for temporal data
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('iot_platform');

// Create time series collection with automatic optimization
const createTimeSeriesCollection = async () => {
  try {
    // Create time series collection with comprehensive configuration
    const collection = await db.createCollection('sensor_readings', {
      timeseries: {
        // Time field - required, used for automatic partitioning
        timeField: 'timestamp',

        // Meta field - optional, groups related time series together
        metaField: 'metadata',

        // Granularity for automatic bucketing and compression
        granularity: 'minutes', // 'seconds', 'minutes', 'hours'

        // Automatic expiration for data retention
        expireAfterSeconds: 60 * 60 * 24 * 365 // 1 year retention
      }
    });

    console.log('Time series collection created successfully');
    return collection;

  } catch (error) {
    console.error('Error creating time series collection:', error);
    throw error;
  }
};

// High-performance time series data ingestion
const ingestSensorData = async () => {
  const sensorReadings = db.collection('sensor_readings');

  // Batch insert for optimal performance
  const batchData = [];
  const batchSize = 1000;
  const currentTime = new Date();

  // Generate realistic IoT sensor data
  for (let i = 0; i < batchSize; i++) {
    const timestamp = new Date(currentTime.getTime() - (i * 60000)); // Every minute

    // Multiple sensors per batch
    ['TEMP_001', 'TEMP_002', 'TEMP_003', 'HUM_001', 'PRESS_001'].forEach(sensorId => {
      batchData.push({
        // Time field (required for time series)
        timestamp: timestamp,

        // Metadata field - groups related measurements
        metadata: {
          sensorId: sensorId,
          deviceId: sensorId.startsWith('TEMP') ? 'CLIMATE_DEV_001' : 
                   sensorId.startsWith('HUM') ? 'CLIMATE_DEV_001' : 'PRESSURE_DEV_001',
          location: {
            building: 'Warehouse_A',
            floor: 1,
            room: 'Storage_Room_1',
            coordinates: {
              x: Math.floor(Math.random() * 100),
              y: Math.floor(Math.random() * 100)
            }
          },
          sensorType: sensorId.startsWith('TEMP') ? 'temperature' :
                     sensorId.startsWith('HUM') ? 'humidity' : 'pressure',
          unit: sensorId.startsWith('TEMP') ? 'celsius' :
                sensorId.startsWith('HUM') ? 'percent' : 'hPa',
          calibrationDate: new Date('2024-01-01'),
          firmwareVersion: '2.1.3'
        },

        // Measurement data - varies by sensor type
        measurements: generateMeasurements(sensorId, timestamp),

        // System metadata
        ingestionTime: new Date(),
        dataQuality: {
          isValid: Math.random() > 0.02, // 2% invalid readings
          confidence: 0.95 + (Math.random() * 0.05), // 95-100% confidence
          calibrationStatus: 'valid',
          lastCalibration: new Date('2024-01-01')
        },

        // Device health metrics
        deviceHealth: {
          batteryLevel: 85 + Math.random() * 15, // 85-100%
          signalStrength: -30 - Math.random() * 40, // -30 to -70 dBm
          temperature: 20 + Math.random() * 10, // Device temp 20-30°C
          uptime: Math.floor(Math.random() * 86400 * 30) // Up to 30 days
        }
      });
    });
  }

  // Batch insert for optimal ingestion performance
  try {
    const result = await sensorReadings.insertMany(batchData, { 
      ordered: false, // Allow parallel insertions
      writeConcern: { w: 1 } // Optimize for ingestion speed
    });

    console.log(`Inserted ${result.insertedCount} sensor readings`);
    return result;

  } catch (error) {
    console.error('Error inserting sensor data:', error);
    throw error;
  }
};

function generateMeasurements(sensorId, timestamp) {
  const baseValues = {
    'TEMP_001': { value: 22, variance: 5 },
    'TEMP_002': { value: 24, variance: 3 },
    'TEMP_003': { value: 20, variance: 4 },
    'HUM_001': { value: 65, variance: 15 },
    'PRESS_001': { value: 1013.25, variance: 5 }
  };

  const base = baseValues[sensorId];
  if (!base) return {};

  // Add some realistic patterns and noise
  const hourOfDay = timestamp.getHours();
  const seasonalEffect = Math.sin((timestamp.getMonth() * Math.PI) / 6) * 2;
  const dailyEffect = Math.sin((hourOfDay * Math.PI) / 12) * 1.5;
  const randomNoise = (Math.random() - 0.5) * base.variance;

  const value = base.value + seasonalEffect + dailyEffect + randomNoise;

  return {
    value: Math.round(value * 100) / 100,
    rawValue: value,
    processed: true,

    // Statistical context
    range: {
      min: base.value - base.variance,
      max: base.value + base.variance
    },

    // Quality indicators
    outlierScore: Math.abs(randomNoise) / base.variance,
    trend: dailyEffect > 0 ? 'increasing' : 'decreasing'
  };
}

// Advanced time series queries and analytics
const performTimeSeriesAnalytics = async () => {
  const sensorReadings = db.collection('sensor_readings');

  // 1. Real-time dashboard data - last 24 hours
  const realtimeDashboard = await sensorReadings.aggregate([
    // Filter to last 24 hours
    {
      $match: {
        timestamp: {
          $gte: new Date(Date.now() - 24 * 60 * 60 * 1000)
        },
        'dataQuality.isValid': true
      }
    },

    // Group by sensor and time bucket for aggregation
    {
      $group: {
        _id: {
          sensorId: '$metadata.sensorId',
          sensorType: '$metadata.sensorType',
          location: '$metadata.location.room',
          // 15-minute time buckets
          timeBucket: {
            $dateTrunc: {
              date: '$timestamp',
              unit: 'minute',
              binSize: 15
            }
          }
        },

        // Statistical aggregations
        count: { $sum: 1 },
        avgValue: { $avg: '$measurements.value' },
        minValue: { $min: '$measurements.value' },
        maxValue: { $max: '$measurements.value' },
        stdDev: { $stdDevPop: '$measurements.value' },

        // First and last readings in bucket
        firstReading: { $first: '$measurements.value' },
        lastReading: { $last: '$measurements.value' },

        // Data quality metrics
        validReadings: {
          $sum: { $cond: ['$dataQuality.isValid', 1, 0] }
        },
        avgConfidence: { $avg: '$dataQuality.confidence' },

        // Device health aggregations
        avgBatteryLevel: { $avg: '$deviceHealth.batteryLevel' },
        avgSignalStrength: { $avg: '$deviceHealth.signalStrength' }
      }
    },

    // Calculate derived metrics
    {
      $addFields: {
        // Value change within bucket
        valueChange: { $subtract: ['$lastReading', '$firstReading'] },

        // Coefficient of variation (relative variability)
        coefficientOfVariation: {
          $cond: {
            if: { $ne: ['$avgValue', 0] },
            then: { $divide: ['$stdDev', '$avgValue'] },
            else: 0
          }
        },

        // Data quality ratio
        dataQualityRatio: { $divide: ['$validReadings', '$count'] },

        // Device health status
        deviceHealthStatus: {
          $switch: {
            branches: [
              {
                case: { 
                  $and: [
                    { $gte: ['$avgBatteryLevel', 80] },
                    { $gte: ['$avgSignalStrength', -50] }
                  ]
                },
                then: 'excellent'
              },
              {
                case: { 
                  $and: [
                    { $gte: ['$avgBatteryLevel', 50] },
                    { $gte: ['$avgSignalStrength', -65] }
                  ]
                },
                then: 'good'
              },
              {
                case: { 
                  $or: [
                    { $lt: ['$avgBatteryLevel', 20] },
                    { $lt: ['$avgSignalStrength', -80] }
                  ]
                },
                then: 'critical'
              }
            ],
            default: 'warning'
          }
        }
      }
    },

    // Sort by sensor and time
    {
      $sort: {
        '_id.sensorId': 1,
        '_id.timeBucket': 1
      }
    },

    // Format output for dashboard consumption
    {
      $group: {
        _id: '$_id.sensorId',
        sensorType: { $first: '$_id.sensorType' },
        location: { $first: '$_id.location' },

        // Time series data points
        timeSeries: {
          $push: {
            timestamp: '$_id.timeBucket',
            value: '$avgValue',
            min: '$minValue',
            max: '$maxValue',
            count: '$count',
            quality: '$dataQualityRatio',
            deviceHealth: '$deviceHealthStatus'
          }
        },

        // Aggregate statistics across all time buckets
        overallStats: {
          $push: {
            avg: '$avgValue',
            stdDev: '$stdDev',
            cv: '$coefficientOfVariation'
          }
        },

        // Latest values
        latestValue: { $last: '$avgValue' },
        latestChange: { $last: '$valueChange' },
        latestQuality: { $last: '$dataQualityRatio' }
      }
    },

    // Calculate final sensor-level statistics
    {
      $addFields: {
        overallAvg: { $avg: '$overallStats.avg' },
        overallStdDev: { $avg: '$overallStats.stdDev' },
        avgCV: { $avg: '$overallStats.cv' },

        // Trend analysis
        trend: {
          $cond: {
            if: { $gt: ['$latestChange', 0.1] },
            then: 'increasing',
            else: {
              $cond: {
                if: { $lt: ['$latestChange', -0.1] },
                then: 'decreasing',
                else: 'stable'
              }
            }
          }
        }
      }
    }
  ]).toArray();

  console.log('Real-time dashboard data:', JSON.stringify(realtimeDashboard, null, 2));

  // 2. Anomaly detection using statistical methods
  const anomalyDetection = await sensorReadings.aggregate([
    {
      $match: {
        timestamp: {
          $gte: new Date(Date.now() - 7 * 24 * 60 * 60 * 1000) // Last 7 days
        }
      }
    },

    // Calculate rolling statistics for anomaly detection
    {
      $setWindowFields: {
        partitionBy: '$metadata.sensorId',
        sortBy: { timestamp: 1 },
        output: {
          // Rolling 30-point average and standard deviation
          rollingAvg: {
            $avg: '$measurements.value',
            window: {
              documents: [-15, 15] // 30-point centered window
            }
          },
          rollingStdDev: {
            $stdDevPop: '$measurements.value',
            window: {
              documents: [-15, 15]
            }
          },

          // Previous values for change detection
          prevValue: {
            $first: '$measurements.value',
            window: {
              documents: [-1, -1]
            }
          }
        }
      }
    },

    // Identify anomalies using statistical thresholds
    {
      $addFields: {
        // Z-score calculation
        zScore: {
          $cond: {
            if: { $ne: ['$rollingStdDev', 0] },
            then: {
              $divide: [
                { $subtract: ['$measurements.value', '$rollingAvg'] },
                '$rollingStdDev'
              ]
            },
            else: 0
          }
        },

        // Rate of change
        rateOfChange: {
          $cond: {
            if: { $and: ['$prevValue', { $ne: ['$prevValue', 0] }] },
            then: {
              $divide: [
                { $subtract: ['$measurements.value', '$prevValue'] },
                '$prevValue'
              ]
            },
            else: 0
          }
        }
      }
    },

    // Filter to potential anomalies
    {
      $match: {
        $or: [
          { zScore: { $gt: 3 } }, // Values > 3 standard deviations
          { zScore: { $lt: -3 } },
          { rateOfChange: { $gt: 0.5 } }, // > 50% change
          { rateOfChange: { $lt: -0.5 } }
        ]
      }
    },

    // Classify anomaly types
    {
      $addFields: {
        anomalyType: {
          $switch: {
            branches: [
              {
                case: { $gt: ['$zScore', 3] },
                then: 'statistical_high'
              },
              {
                case: { $lt: ['$zScore', -3] },
                then: 'statistical_low'
              },
              {
                case: { $gt: ['$rateOfChange', 0.5] },
                then: 'rapid_increase'
              },
              {
                case: { $lt: ['$rateOfChange', -0.5] },
                then: 'rapid_decrease'
              }
            ],
            default: 'unknown'
          }
        },

        anomalySeverity: {
          $switch: {
            branches: [
              {
                case: { 
                  $or: [
                    { $gt: ['$zScore', 5] },
                    { $lt: ['$zScore', -5] }
                  ]
                },
                then: 'critical'
              },
              {
                case: { 
                  $or: [
                    { $gt: ['$zScore', 4] },
                    { $lt: ['$zScore', -4] }
                  ]
                },
                then: 'high'
              }
            ],
            default: 'medium'
          }
        }
      }
    },

    // Group anomalies by sensor and type
    {
      $group: {
        _id: {
          sensorId: '$metadata.sensorId',
          anomalyType: '$anomalyType'
        },
        count: { $sum: 1 },
        avgSeverity: { $avg: '$zScore' },
        latestAnomaly: { $max: '$timestamp' },
        anomalies: {
          $push: {
            timestamp: '$timestamp',
            value: '$measurements.value',
            zScore: '$zScore',
            rateOfChange: '$rateOfChange',
            severity: '$anomalySeverity'
          }
        }
      }
    },

    {
      $sort: {
        '_id.sensorId': 1,
        count: -1
      }
    }
  ]).toArray();

  console.log('Anomaly detection results:', JSON.stringify(anomalyDetection, null, 2));

  // 3. Predictive maintenance analysis
  const predictiveMaintenance = await sensorReadings.aggregate([
    {
      $match: {
        timestamp: {
          $gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) // Last 30 days
        }
      }
    },

    // Calculate device health trends
    {
      $group: {
        _id: {
          deviceId: '$metadata.deviceId',
          day: {
            $dateTrunc: {
              date: '$timestamp',
              unit: 'day'
            }
          }
        },

        avgBatteryLevel: { $avg: '$deviceHealth.batteryLevel' },
        avgSignalStrength: { $avg: '$deviceHealth.signalStrength' },
        readingCount: { $sum: 1 },
        errorRate: {
          $avg: { $cond: ['$dataQuality.isValid', 0, 1] }
        }
      }
    },

    // Calculate trends using linear regression approximation
    {
      $setWindowFields: {
        partitionBy: '$_id.deviceId',
        sortBy: { '_id.day': 1 },
        output: {
          batteryTrend: {
            $linearFill: '$avgBatteryLevel'
          },
          signalTrend: {
            $linearFill: '$avgSignalStrength'
          }
        }
      }
    },

    // Predict maintenance needs
    {
      $addFields: {
        batteryDaysRemaining: {
          $cond: {
            if: { $lt: ['$batteryTrend', 0] },
            then: {
              $ceil: {
                $divide: ['$avgBatteryLevel', { $abs: '$batteryTrend' }]
              }
            },
            else: 365 // Battery not declining
          }
        },

        maintenanceRisk: {
          $switch: {
            branches: [
              {
                case: {
                  $or: [
                    { $lt: ['$avgBatteryLevel', 20] },
                    { $gt: ['$errorRate', 0.1] }
                  ]
                },
                then: 'immediate'
              },
              {
                case: {
                  $or: [
                    { $lt: ['$avgBatteryLevel', 40] },
                    { $lt: ['$avgSignalStrength', -70] }
                  ]
                },
                then: 'high'
              },
              {
                case: { $lt: ['$avgBatteryLevel', 60] },
                then: 'medium'
              }
            ],
            default: 'low'
          }
        }
      }
    },

    // Group by device with latest status
    {
      $group: {
        _id: '$_id.deviceId',
        latestBatteryLevel: { $last: '$avgBatteryLevel' },
        latestSignalStrength: { $last: '$avgSignalStrength' },
        batteryTrend: { $last: '$batteryTrend' },
        signalTrend: { $last: '$signalTrend' },
        estimatedBatteryDays: { $last: '$batteryDaysRemaining' },
        maintenanceRisk: { $last: '$maintenanceRisk' },
        avgErrorRate: { $avg: '$errorRate' }
      }
    },

    {
      $sort: {
        maintenanceRisk: 1, // immediate first
        estimatedBatteryDays: 1
      }
    }
  ]).toArray();

  console.log('Predictive maintenance analysis:', JSON.stringify(predictiveMaintenance, null, 2));

  return {
    realtimeDashboard,
    anomalyDetection,
    predictiveMaintenance
  };
};

// Benefits of MongoDB Time Series Collections:
// - Automatic data partitioning and compression optimized for time-based data
// - Built-in retention policies with automatic expiration
// - Optimized indexes and query patterns for temporal analytics
// - High-performance ingestion with automatic bucketing
// - Native aggregation framework support for complex time series analysis
// - Flexible schema evolution for changing IoT device requirements
// - Horizontal scaling across sharded clusters
// - Integration with existing MongoDB ecosystem and tools
// - Real-time analytics with change streams for live dashboards
// - Cost-effective storage with intelligent compression algorithms

Understanding MongoDB Time Series Architecture

Time Series Collection Design Patterns

Implement comprehensive time series patterns for different IoT scenarios:

// Advanced time series collection design patterns
class IoTTimeSeriesManager {
  constructor(db) {
    this.db = db;
    this.collections = new Map();
    this.ingestionBuffers = new Map();
  }

  async createIoTTimeSeriesCollections() {
    // Pattern 1: High-frequency sensor data
    const highFrequencyConfig = {
      timeseries: {
        timeField: 'timestamp',
        metaField: 'sensor',
        granularity: 'seconds', // For sub-minute data
        bucketMaxSpanSeconds: 3600, // 1-hour buckets
        bucketRoundingSeconds: 60 // Round to minute boundaries
      },
      expireAfterSeconds: 60 * 60 * 24 * 30 // 30 days retention
    };

    const highFrequencySensors = await this.db.createCollection(
      'high_frequency_sensors', 
      highFrequencyConfig
    );

    // Pattern 2: Environmental monitoring (medium frequency)
    const environmentalConfig = {
      timeseries: {
        timeField: 'timestamp',
        metaField: 'location',
        granularity: 'minutes',
        bucketMaxSpanSeconds: 86400, // 24-hour buckets
        bucketRoundingSeconds: 3600 // Round to hour boundaries
      },
      expireAfterSeconds: 60 * 60 * 24 * 365 // 1 year retention
    };

    const environmentalData = await this.db.createCollection(
      'environmental_monitoring',
      environmentalConfig
    );

    // Pattern 3: Device health metrics (low frequency)
    const deviceHealthConfig = {
      timeseries: {
        timeField: 'timestamp',
        metaField: 'device',
        granularity: 'hours',
        bucketMaxSpanSeconds: 86400 * 7, // Weekly buckets
        bucketRoundingSeconds: 86400 // Round to day boundaries
      },
      expireAfterSeconds: 60 * 60 * 24 * 365 * 5 // 5 years retention
    };

    const deviceHealth = await this.db.createCollection(
      'device_health_metrics',
      deviceHealthConfig
    );

    // Pattern 4: Event-based time series (irregular intervals)
    const eventBasedConfig = {
      timeseries: {
        timeField: 'timestamp',
        metaField: 'eventSource',
        granularity: 'minutes' // Flexible for irregular events
      },
      expireAfterSeconds: 60 * 60 * 24 * 90 // 90 days retention
    };

    const eventTimeSeries = await this.db.createCollection(
      'event_time_series',
      eventBasedConfig
    );

    // Store collection references
    this.collections.set('highFrequency', highFrequencySensors);
    this.collections.set('environmental', environmentalData);
    this.collections.set('deviceHealth', deviceHealth);
    this.collections.set('events', eventTimeSeries);

    console.log('Time series collections created successfully');
    return this.collections;
  }

  async setupOptimalIndexes() {
    // Create compound indexes for common query patterns
    for (const [name, collection] of this.collections.entries()) {
      try {
        // Metadata + time range queries
        await collection.createIndex({
          'sensor.id': 1,
          'timestamp': 1
        });

        // Location-based queries
        await collection.createIndex({
          'sensor.location.building': 1,
          'sensor.location.floor': 1,
          'timestamp': 1
        });

        // Device type queries
        await collection.createIndex({
          'sensor.type': 1,
          'timestamp': 1
        });

        // Data quality queries
        await collection.createIndex({
          'quality.isValid': 1,
          'timestamp': 1
        });

        console.log(`Indexes created for ${name} collection`);

      } catch (error) {
        console.error(`Error creating indexes for ${name}:`, error);
      }
    }
  }

  async ingestHighFrequencyData(sensorData) {
    // High-performance ingestion with batching
    const collection = this.collections.get('highFrequency');
    const batchSize = 10000;
    const batches = [];

    // Prepare optimized document structure
    const documents = sensorData.map(reading => ({
      timestamp: new Date(reading.timestamp),

      // Metadata field - groups related time series
      sensor: {
        id: reading.sensorId,
        type: reading.sensorType,
        model: reading.model || 'Unknown',
        location: {
          building: reading.building,
          floor: reading.floor,
          room: reading.room,
          coordinates: reading.coordinates
        },
        specifications: {
          accuracy: reading.accuracy,
          range: reading.range,
          units: reading.units
        }
      },

      // Measurements - optimized for compression
      temp: reading.temperature,
      hum: reading.humidity,
      press: reading.pressure,

      // Device status
      batt: reading.batteryLevel,
      signal: reading.signalStrength,

      // Data quality indicators
      quality: {
        isValid: reading.isValid !== false,
        confidence: reading.confidence || 1.0,
        source: reading.source || 'sensor'
      }
    }));

    // Split into batches for optimal ingestion
    for (let i = 0; i < documents.length; i += batchSize) {
      batches.push(documents.slice(i, i + batchSize));
    }

    // Parallel batch ingestion
    const ingestionPromises = batches.map(async (batch, index) => {
      try {
        const result = await collection.insertMany(batch, {
          ordered: false,
          writeConcern: { w: 1 }
        });

        console.log(`Batch ${index + 1}: Inserted ${result.insertedCount} documents`);
        return result.insertedCount;

      } catch (error) {
        console.error(`Batch ${index + 1} failed:`, error);
        return 0;
      }
    });

    const results = await Promise.all(ingestionPromises);
    const totalInserted = results.reduce((sum, count) => sum + count, 0);

    console.log(`Total documents inserted: ${totalInserted}`);
    return totalInserted;
  }

  async performRealTimeAnalytics(timeRange = '1h', sensorIds = []) {
    const collection = this.collections.get('highFrequency');

    // Calculate time range
    const timeRangeMs = {
      '15m': 15 * 60 * 1000,
      '1h': 60 * 60 * 1000,
      '6h': 6 * 60 * 60 * 1000,
      '24h': 24 * 60 * 60 * 1000
    };

    const startTime = new Date(Date.now() - timeRangeMs[timeRange]);

    const pipeline = [
      // Time range and sensor filtering
      {
        $match: {
          timestamp: { $gte: startTime },
          ...(sensorIds.length > 0 && { 'sensor.id': { $in: sensorIds } }),
          'quality.isValid': true
        }
      },

      // Time-based bucketing for aggregation
      {
        $group: {
          _id: {
            sensorId: '$sensor.id',
            sensorType: '$sensor.type',
            location: '$sensor.location.room',
            // Dynamic time bucketing based on range
            timeBucket: {
              $dateTrunc: {
                date: '$timestamp',
                unit: 'minute',
                binSize: timeRange === '15m' ? 1 : 
                        timeRange === '1h' ? 5 : 
                        timeRange === '6h' ? 15 : 60
              }
            }
          },

          // Statistical aggregations
          count: { $sum: 1 },

          // Temperature metrics
          tempAvg: { $avg: '$temp' },
          tempMin: { $min: '$temp' },
          tempMax: { $max: '$temp' },
          tempStdDev: { $stdDevPop: '$temp' },

          // Humidity metrics
          humAvg: { $avg: '$hum' },
          humMin: { $min: '$hum' },
          humMax: { $max: '$hum' },

          // Pressure metrics
          pressAvg: { $avg: '$press' },
          pressMin: { $min: '$press' },
          pressMax: { $max: '$press' },

          // Device health metrics
          battAvg: { $avg: '$batt' },
          battMin: { $min: '$batt' },
          signalAvg: { $avg: '$signal' },
          signalMin: { $min: '$signal' },

          // Data quality metrics
          validReadings: { $sum: 1 },
          avgConfidence: { $avg: '$quality.confidence' },

          // First and last values for trend calculation
          firstTemp: { $first: '$temp' },
          lastTemp: { $last: '$temp' },
          firstTimestamp: { $first: '$timestamp' },
          lastTimestamp: { $last: '$timestamp' }
        }
      },

      // Calculate derived metrics
      {
        $addFields: {
          // Temperature trends
          tempTrend: { $subtract: ['$lastTemp', '$firstTemp'] },
          tempCV: {
            $cond: {
              if: { $ne: ['$tempAvg', 0] },
              then: { $divide: ['$tempStdDev', '$tempAvg'] },
              else: 0
            }
          },

          // Time span for rate calculations
          timeSpanMinutes: {
            $divide: [
              { $subtract: ['$lastTimestamp', '$firstTimestamp'] },
              60000
            ]
          },

          // Device health status
          deviceStatus: {
            $switch: {
              branches: [
                {
                  case: { 
                    $and: [
                      { $gte: ['$battAvg', 80] },
                      { $gte: ['$signalAvg', -50] }
                    ]
                  },
                  then: 'excellent'
                },
                {
                  case: {
                    $and: [
                      { $gte: ['$battAvg', 50] },
                      { $gte: ['$signalAvg', -65] }
                    ]
                  },
                  then: 'good'
                },
                {
                  case: {
                    $or: [
                      { $lt: ['$battAvg', 20] },
                      { $lt: ['$signalAvg', -80] }
                    ]
                  },
                  then: 'critical'
                }
              ],
              default: 'warning'
            }
          }
        }
      },

      // Sort for time series presentation
      {
        $sort: {
          '_id.sensorId': 1,
          '_id.timeBucket': 1
        }
      },

      // Format for dashboard consumption
      {
        $group: {
          _id: '$_id.sensorId',
          sensorType: { $first: '$_id.sensorType' },
          location: { $first: '$_id.location' },

          // Time series data
          timeSeries: {
            $push: {
              timestamp: '$_id.timeBucket',
              temperature: {
                avg: '$tempAvg',
                min: '$tempMin',
                max: '$tempMax',
                trend: '$tempTrend',
                cv: '$tempCV'
              },
              humidity: {
                avg: '$humAvg',
                min: '$humMin',
                max: '$humMax'
              },
              pressure: {
                avg: '$pressAvg',
                min: '$pressMin',
                max: '$pressMax'
              },
              deviceHealth: {
                battery: '$battAvg',
                signal: '$signalAvg',
                status: '$deviceStatus'
              },
              dataQuality: {
                readingCount: '$count',
                confidence: '$avgConfidence'
              }
            }
          },

          // Summary statistics
          summaryStats: {
            totalReadings: { $sum: '$count' },
            avgTemperature: { $avg: '$tempAvg' },
            temperatureRange: {
              $subtract: [{ $max: '$tempMax' }, { $min: '$tempMin' }]
            },
            overallDeviceStatus: { $last: '$deviceStatus' }
          }
        }
      }
    ];

    const results = await collection.aggregate(pipeline).toArray();

    // Add metadata about the query
    return {
      timeRange: timeRange,
      queryTime: new Date(),
      startTime: startTime,
      endTime: new Date(),
      sensorCount: results.length,
      data: results
    };
  }

  async detectAnomaliesAdvanced(sensorId, lookbackHours = 168) { // 1 week default
    const collection = this.collections.get('highFrequency');
    const lookbackTime = new Date(Date.now() - lookbackHours * 60 * 60 * 1000);

    const pipeline = [
      {
        $match: {
          'sensor.id': sensorId,
          timestamp: { $gte: lookbackTime },
          'quality.isValid': true
        }
      },

      { $sort: { timestamp: 1 } },

      // Calculate rolling statistics using window functions
      {
        $setWindowFields: {
          sortBy: { timestamp: 1 },
          output: {
            // Rolling 50-point statistics for anomaly detection
            rollingMean: {
              $avg: '$temp',
              window: { documents: [-25, 25] }
            },
            rollingStd: {
              $stdDevPop: '$temp',
              window: { documents: [-25, 25] }
            },

            // Seasonal decomposition (24-hour pattern)
            dailyMean: {
              $avg: '$temp',
              window: { range: [-12, 12], unit: 'hour' }
            },

            // Trend analysis
            trendSlope: {
              $linearFill: '$temp'
            },

            // Previous values for rate of change
            prevTemp: {
              $first: '$temp',
              window: { documents: [-1, -1] }
            }
          }
        }
      },

      // Calculate anomaly scores
      {
        $addFields: {
          // Z-score anomaly detection
          zScore: {
            $cond: {
              if: { $ne: ['$rollingStd', 0] },
              then: {
                $divide: [
                  { $subtract: ['$temp', '$rollingMean'] },
                  '$rollingStd'
                ]
              },
              else: 0
            }
          },

          // Seasonal anomaly (deviation from daily pattern)
          seasonalAnomaly: {
            $cond: {
              if: { $ne: ['$dailyMean', 0] },
              then: {
                $abs: {
                  $divide: [
                    { $subtract: ['$temp', '$dailyMean'] },
                    '$dailyMean'
                  ]
                }
              },
              else: 0
            }
          },

          // Rate of change anomaly
          rateOfChange: {
            $cond: {
              if: { $and: ['$prevTemp', { $ne: ['$prevTemp', 0] }] },
              then: {
                $abs: {
                  $divide: [
                    { $subtract: ['$temp', '$prevTemp'] },
                    '$prevTemp'
                  ]
                }
              },
              else: 0
            }
          }
        }
      },

      // Identify anomalies using multiple criteria
      {
        $addFields: {
          isAnomaly: {
            $or: [
              { $gt: [{ $abs: '$zScore' }, 3] }, // Statistical outlier
              { $gt: ['$seasonalAnomaly', 0.3] }, // 30% deviation from seasonal
              { $gt: ['$rateOfChange', 0.5] } // 50% rate of change
            ]
          },

          anomalyType: {
            $switch: {
              branches: [
                {
                  case: { $gt: ['$zScore', 3] },
                  then: 'statistical_high'
                },
                {
                  case: { $lt: ['$zScore', -3] },
                  then: 'statistical_low'
                },
                {
                  case: { $gt: ['$seasonalAnomaly', 0.3] },
                  then: 'seasonal_deviation'
                },
                {
                  case: { $gt: ['$rateOfChange', 0.5] },
                  then: 'rapid_change'
                }
              ],
              default: 'normal'
            }
          },

          anomalySeverity: {
            $switch: {
              branches: [
                {
                  case: { $gt: [{ $abs: '$zScore' }, 5] },
                  then: 'critical'
                },
                {
                  case: { $gt: [{ $abs: '$zScore' }, 4] },
                  then: 'high'
                },
                {
                  case: { $gt: [{ $abs: '$zScore' }, 3] },
                  then: 'medium'
                }
              ],
              default: 'low'
            }
          }
        }
      },

      // Filter to anomalies only
      { $match: { isAnomaly: true } },

      // Group consecutive anomalies into events
      {
        $group: {
          _id: {
            $dateToString: {
              format: '%Y-%m-%d-%H',
              date: '$timestamp'
            }
          },

          anomalyCount: { $sum: 1 },
          avgSeverityScore: { $avg: { $abs: '$zScore' } },

          anomalies: {
            $push: {
              timestamp: '$timestamp',
              value: '$temp',
              zScore: '$zScore',
              type: '$anomalyType',
              severity: '$anomalySeverity',
              seasonalDeviation: '$seasonalAnomaly',
              rateOfChange: '$rateOfChange'
            }
          },

          startTime: { $min: '$timestamp' },
          endTime: { $max: '$timestamp' }
        }
      },

      { $sort: { startTime: -1 } }
    ];

    return await collection.aggregate(pipeline).toArray();
  }

  async generatePerformanceReports(reportType = 'daily') {
    const collection = this.collections.get('highFrequency');

    // Calculate report time range
    const timeRanges = {
      'hourly': 60 * 60 * 1000,
      'daily': 24 * 60 * 60 * 1000,
      'weekly': 7 * 24 * 60 * 60 * 1000,
      'monthly': 30 * 24 * 60 * 60 * 1000
    };

    const startTime = new Date(Date.now() - timeRanges[reportType]);

    const pipeline = [
      {
        $match: {
          timestamp: { $gte: startTime }
        }
      },

      // Group by sensor and time period
      {
        $group: {
          _id: {
            sensorId: '$sensor.id',
            sensorType: '$sensor.type',
            location: '$sensor.location',
            period: {
              $dateTrunc: {
                date: '$timestamp',
                unit: reportType === 'hourly' ? 'hour' :
                      reportType === 'daily' ? 'day' :
                      reportType === 'weekly' ? 'week' : 'month'
              }
            }
          },

          // Data volume metrics
          totalReadings: { $sum: 1 },
          validReadings: {
            $sum: { $cond: ['$quality.isValid', 1, 0] }
          },

          // Data quality metrics
          avgConfidence: { $avg: '$quality.confidence' },
          dataQualityRatio: {
            $avg: { $cond: ['$quality.isValid', 1, 0] }
          },

          // Measurement statistics
          tempStats: {
            $push: {
              avg: { $avg: '$temp' },
              min: { $min: '$temp' },
              max: { $max: '$temp' },
              stdDev: { $stdDevPop: '$temp' }
            }
          },

          // Device health metrics
          avgBatteryLevel: { $avg: '$batt' },
          minBatteryLevel: { $min: '$batt' },
          avgSignalStrength: { $avg: '$signal' },
          minSignalStrength: { $min: '$signal' },

          // Time coverage
          firstReading: { $min: '$timestamp' },
          lastReading: { $max: '$timestamp' }
        }
      },

      // Calculate performance indicators
      {
        $addFields: {
          // Coverage percentage
          coveragePercentage: {
            $multiply: [
              {
                $divide: [
                  { $subtract: ['$lastReading', '$firstReading'] },
                  timeRanges[reportType]
                ]
              },
              100
            ]
          },

          // Device health score
          deviceHealthScore: {
            $multiply: [
              {
                $add: [
                  { $divide: ['$avgBatteryLevel', 100] }, // Battery factor
                  { $divide: [{ $add: ['$avgSignalStrength', 100] }, 50] } // Signal factor
                ]
              },
              50
            ]
          },

          // Overall performance score
          performanceScore: {
            $multiply: [
              {
                $add: [
                  { $multiply: ['$dataQualityRatio', 0.4] },
                  { $multiply: [{ $divide: ['$avgConfidence', 1] }, 0.3] },
                  { $multiply: [{ $divide: ['$avgBatteryLevel', 100] }, 0.2] },
                  { $multiply: [{ $divide: [{ $add: ['$avgSignalStrength', 100] }, 50] }, 0.1] }
                ]
              },
              100
            ]
          }
        }
      },

      // Generate recommendations
      {
        $addFields: {
          recommendations: {
            $switch: {
              branches: [
                {
                  case: { $lt: ['$dataQualityRatio', 0.9] },
                  then: ['Investigate data quality issues', 'Check sensor calibration']
                },
                {
                  case: { $lt: ['$avgBatteryLevel', 30] },
                  then: ['Schedule battery replacement', 'Consider solar charging']
                },
                {
                  case: { $lt: ['$avgSignalStrength', -75] },
                  then: ['Check network connectivity', 'Consider signal boosters']
                },
                {
                  case: { $lt: ['$coveragePercentage', 95] },
                  then: ['Investigate data gaps', 'Check device uptime']
                }
              ],
              default: ['Performance within normal parameters']
            }
          },

          alertLevel: {
            $switch: {
              branches: [
                {
                  case: { $lt: ['$performanceScore', 60] },
                  then: 'critical'
                },
                {
                  case: { $lt: ['$performanceScore', 80] },
                  then: 'warning'
                }
              ],
              default: 'normal'
            }
          }
        }
      },

      {
        $sort: {
          performanceScore: 1, // Lowest scores first
          '_id.sensorId': 1
        }
      }
    ];

    const results = await collection.aggregate(pipeline).toArray();

    return {
      reportType: reportType,
      generatedAt: new Date(),
      timeRange: {
        start: startTime,
        end: new Date()
      },
      summary: {
        totalSensors: results.length,
        criticalAlerts: results.filter(r => r.alertLevel === 'critical').length,
        warnings: results.filter(r => r.alertLevel === 'warning').length,
        avgPerformanceScore: results.reduce((sum, r) => sum + r.performanceScore, 0) / results.length
      },
      sensorReports: results
    };
  }
}

SQL-Style Time Series Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB Time Series operations:

-- QueryLeaf time series operations with SQL-familiar syntax

-- Create time series collection
CREATE TIME_SERIES COLLECTION sensor_readings (
  timestamp TIMESTAMP NOT NULL, -- time field
  sensor_id VARCHAR(100) NOT NULL,
  location VARCHAR(200),
  device_id VARCHAR(100),

  -- Measurements
  temperature DECIMAL(5,2),
  humidity DECIMAL(5,2),
  pressure DECIMAL(7,2),

  -- Device health
  battery_level DECIMAL(3,2),
  signal_strength INTEGER,

  -- Data quality
  is_valid BOOLEAN DEFAULT true,
  confidence DECIMAL(3,2) DEFAULT 1.00
) WITH (
  meta_field = 'sensor_metadata',
  granularity = 'minutes',
  expire_after_seconds = 2678400 -- 31 days
);

-- High-performance batch insert for IoT data
INSERT INTO sensor_readings 
VALUES 
  ('2024-09-17 10:00:00', 'TEMP_001', 'Warehouse_A', 'DEV_001', 23.5, 65.2, 1013.25, 85.3, -45, true, 0.98),
  ('2024-09-17 10:01:00', 'TEMP_001', 'Warehouse_A', 'DEV_001', 23.7, 65.0, 1013.30, 85.2, -46, true, 0.97),
  ('2024-09-17 10:02:00', 'TEMP_001', 'Warehouse_A', 'DEV_001', 23.6, 64.8, 1013.28, 85.1, -44, true, 0.99);

-- Real-time dashboard query with time bucketing
SELECT 
  sensor_id,
  location,
  TIME_BUCKET('15 minutes', timestamp) as time_bucket,

  -- Statistical aggregations
  COUNT(*) as reading_count,
  AVG(temperature) as avg_temperature,
  MIN(temperature) as min_temperature,
  MAX(temperature) as max_temperature,
  STDDEV_POP(temperature) as temp_stddev,

  AVG(humidity) as avg_humidity,
  AVG(pressure) as avg_pressure,

  -- Device health metrics
  AVG(battery_level) as avg_battery,
  MIN(battery_level) as min_battery,
  AVG(signal_strength) as avg_signal,

  -- Data quality metrics
  SUM(CASE WHEN is_valid THEN 1 ELSE 0 END) as valid_readings,
  AVG(confidence) as avg_confidence,

  -- Trend indicators
  FIRST_VALUE(temperature ORDER BY timestamp) as first_temp,
  LAST_VALUE(temperature ORDER BY timestamp) as last_temp,
  LAST_VALUE(temperature ORDER BY timestamp) - FIRST_VALUE(temperature ORDER BY timestamp) as temp_change

FROM sensor_readings
WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  AND sensor_id IN ('TEMP_001', 'TEMP_002', 'TEMP_003')
  AND is_valid = true
GROUP BY sensor_id, location, TIME_BUCKET('15 minutes', timestamp)
ORDER BY sensor_id, time_bucket;

-- Advanced anomaly detection with window functions
WITH statistical_baseline AS (
  SELECT 
    sensor_id,
    timestamp,
    temperature,

    -- Rolling statistics for anomaly detection
    AVG(temperature) OVER (
      PARTITION BY sensor_id
      ORDER BY timestamp
      ROWS BETWEEN 25 PRECEDING AND 25 FOLLOWING
    ) as rolling_avg,

    STDDEV_POP(temperature) OVER (
      PARTITION BY sensor_id  
      ORDER BY timestamp
      ROWS BETWEEN 25 PRECEDING AND 25 FOLLOWING
    ) as rolling_stddev,

    -- Seasonal baseline (same hour of day pattern)
    AVG(temperature) OVER (
      PARTITION BY sensor_id, EXTRACT(hour FROM timestamp)
      ORDER BY timestamp
      RANGE BETWEEN INTERVAL '7 days' PRECEDING AND INTERVAL '7 days' FOLLOWING
    ) as seasonal_avg,

    -- Previous value for rate of change
    LAG(temperature, 1) OVER (
      PARTITION BY sensor_id 
      ORDER BY timestamp
    ) as prev_temperature

  FROM sensor_readings
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days'
    AND is_valid = true
),
anomaly_scores AS (
  SELECT *,
    -- Z-score calculation
    CASE 
      WHEN rolling_stddev > 0 THEN (temperature - rolling_avg) / rolling_stddev
      ELSE 0 
    END as z_score,

    -- Seasonal deviation
    ABS(temperature - seasonal_avg) / GREATEST(seasonal_avg, 0.1) as seasonal_deviation,

    -- Rate of change
    CASE 
      WHEN prev_temperature IS NOT NULL AND prev_temperature != 0 
      THEN ABS(temperature - prev_temperature) / ABS(prev_temperature)
      ELSE 0 
    END as rate_of_change

  FROM statistical_baseline
),
classified_anomalies AS (
  SELECT *,
    -- Anomaly classification
    CASE
      WHEN ABS(z_score) > 3 OR seasonal_deviation > 0.3 OR rate_of_change > 0.5 THEN true
      ELSE false
    END as is_anomaly,

    CASE 
      WHEN z_score > 3 THEN 'statistical_high'
      WHEN z_score < -3 THEN 'statistical_low'
      WHEN seasonal_deviation > 0.3 THEN 'seasonal_deviation'
      WHEN rate_of_change > 0.5 THEN 'rapid_change'
      ELSE 'normal'
    END as anomaly_type,

    CASE
      WHEN ABS(z_score) > 5 THEN 'critical'
      WHEN ABS(z_score) > 4 THEN 'high'
      WHEN ABS(z_score) > 3 THEN 'medium'
      ELSE 'low'
    END as severity

  FROM anomaly_scores
)
SELECT 
  sensor_id,
  DATE_TRUNC('hour', timestamp) as anomaly_hour,
  COUNT(*) as anomaly_count,
  AVG(ABS(z_score)) as avg_severity_score,

  -- Anomaly details
  json_agg(
    json_build_object(
      'timestamp', timestamp,
      'temperature', temperature,
      'z_score', ROUND(z_score::numeric, 3),
      'type', anomaly_type,
      'severity', severity
    ) ORDER BY timestamp
  ) as anomalies,

  MIN(timestamp) as first_anomaly,
  MAX(timestamp) as last_anomaly

FROM classified_anomalies
WHERE is_anomaly = true
GROUP BY sensor_id, DATE_TRUNC('hour', timestamp)
ORDER BY sensor_id, anomaly_hour DESC;

-- Predictive maintenance analysis
WITH device_health_trends AS (
  SELECT 
    device_id,
    sensor_id,
    DATE_TRUNC('day', timestamp) as day,

    AVG(battery_level) as daily_battery_avg,
    MIN(battery_level) as daily_battery_min,
    AVG(signal_strength) as daily_signal_avg,
    MIN(signal_strength) as daily_signal_min,
    COUNT(*) as daily_reading_count,

    -- Data quality metrics
    AVG(CASE WHEN is_valid THEN 1.0 ELSE 0.0 END) as data_quality_ratio,
    AVG(confidence) as avg_confidence

  FROM sensor_readings
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '30 days'
  GROUP BY device_id, sensor_id, DATE_TRUNC('day', timestamp)
),
trend_analysis AS (
  SELECT *,
    -- Linear trend approximation using least squares
    REGR_SLOPE(daily_battery_avg, EXTRACT(epoch FROM day)) * 86400 as battery_daily_slope,
    REGR_SLOPE(daily_signal_avg, EXTRACT(epoch FROM day)) * 86400 as signal_daily_slope,

    -- Device health scoring
    (daily_battery_avg * 0.4 + 
     (daily_signal_avg + 100) / 50 * 100 * 0.3 +
     data_quality_ratio * 100 * 0.3) as health_score

  FROM device_health_trends
),
maintenance_predictions AS (
  SELECT 
    device_id,

    -- Latest status
    LAST_VALUE(daily_battery_avg ORDER BY day) as current_battery,
    LAST_VALUE(daily_signal_avg ORDER BY day) as current_signal,
    LAST_VALUE(data_quality_ratio ORDER BY day) as current_quality,
    LAST_VALUE(health_score ORDER BY day) as current_health_score,

    -- Trends
    AVG(battery_daily_slope) as battery_trend,
    AVG(signal_daily_slope) as signal_trend,

    -- Predictions
    CASE 
      WHEN AVG(battery_daily_slope) < -0.5 THEN 
        CEIL(LAST_VALUE(daily_battery_avg ORDER BY day) / ABS(AVG(battery_daily_slope)))
      ELSE 365 
    END as estimated_battery_days,

    -- Risk assessment
    CASE
      WHEN LAST_VALUE(daily_battery_avg ORDER BY day) < 20 OR 
           LAST_VALUE(data_quality_ratio ORDER BY day) < 0.8 THEN 'immediate'
      WHEN LAST_VALUE(daily_battery_avg ORDER BY day) < 40 OR 
           LAST_VALUE(daily_signal_avg ORDER BY day) < -70 THEN 'high'
      WHEN LAST_VALUE(daily_battery_avg ORDER BY day) < 60 THEN 'medium'
      ELSE 'low'
    END as maintenance_risk,

    COUNT(*) as days_monitored

  FROM trend_analysis
  GROUP BY device_id
)
SELECT 
  device_id,
  ROUND(current_battery, 1) as battery_level,
  ROUND(current_signal, 1) as signal_strength,
  ROUND(current_quality * 100, 1) as data_quality_pct,
  ROUND(current_health_score, 1) as health_score,

  -- Trends
  CASE 
    WHEN battery_trend < -0.1 THEN 'declining'
    WHEN battery_trend > 0.1 THEN 'improving'
    ELSE 'stable'
  END as battery_trend_status,

  estimated_battery_days,
  maintenance_risk,

  -- Recommendations
  CASE maintenance_risk
    WHEN 'immediate' THEN 'Schedule maintenance within 24 hours'
    WHEN 'high' THEN 'Schedule maintenance within 1 week'  
    WHEN 'medium' THEN 'Schedule maintenance within 1 month'
    ELSE 'Monitor normal schedule'
  END as recommendation,

  days_monitored

FROM maintenance_predictions
ORDER BY 
  CASE maintenance_risk
    WHEN 'immediate' THEN 1
    WHEN 'high' THEN 2
    WHEN 'medium' THEN 3
    ELSE 4
  END,
  estimated_battery_days ASC;

-- Time series downsampling and data retention
CREATE MATERIALIZED VIEW hourly_sensor_summary AS
SELECT 
  sensor_id,
  location,
  device_id,
  TIME_BUCKET('1 hour', timestamp) as hour_bucket,

  -- Statistical summaries
  COUNT(*) as reading_count,
  AVG(temperature) as avg_temperature,
  MIN(temperature) as min_temperature,  
  MAX(temperature) as max_temperature,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY temperature) as median_temperature,
  STDDEV_POP(temperature) as temp_stddev,

  AVG(humidity) as avg_humidity,
  AVG(pressure) as avg_pressure,

  -- Device health summaries
  AVG(battery_level) as avg_battery,
  MIN(battery_level) as min_battery,
  AVG(signal_strength) as avg_signal,

  -- Quality metrics
  AVG(CASE WHEN is_valid THEN 1.0 ELSE 0.0 END) as data_quality,
  AVG(confidence) as avg_confidence,

  -- Time range
  MIN(timestamp) as period_start,
  MAX(timestamp) as period_end

FROM sensor_readings
WHERE is_valid = true
GROUP BY sensor_id, location, device_id, TIME_BUCKET('1 hour', timestamp);

-- Performance monitoring and optimization
WITH collection_stats AS (
  SELECT 
    'sensor_readings' as collection_name,
    COUNT(*) as total_documents,

    -- Time range analysis
    MIN(timestamp) as earliest_data,
    MAX(timestamp) as latest_data,
    MAX(timestamp) - MIN(timestamp) as time_span,

    -- Data volume analysis  
    COUNT(*) / EXTRACT(days FROM (MAX(timestamp) - MIN(timestamp))) as avg_docs_per_day,

    -- Quality metrics
    AVG(CASE WHEN is_valid THEN 1.0 ELSE 0.0 END) as overall_quality,
    COUNT(DISTINCT sensor_id) as unique_sensors,
    COUNT(DISTINCT device_id) as unique_devices

  FROM sensor_readings
),
performance_metrics AS (
  SELECT 
    cs.*,

    -- Storage efficiency estimates
    total_documents * 200 as estimated_storage_bytes, -- Rough estimate

    -- Query performance indicators
    CASE 
      WHEN avg_docs_per_day > 100000 THEN 'high_volume'
      WHEN avg_docs_per_day > 10000 THEN 'medium_volume'
      ELSE 'low_volume'
    END as volume_category,

    -- Recommendations
    CASE
      WHEN overall_quality < 0.9 THEN 'Review data validation and sensor calibration'
      WHEN avg_docs_per_day > 100000 THEN 'Consider additional indexing and archiving strategy'
      WHEN time_span > INTERVAL '6 months' THEN 'Implement data lifecycle management'
      ELSE 'Performance within normal parameters'
    END as recommendation

  FROM collection_stats cs
)
SELECT 
  collection_name,
  total_documents,
  TO_CHAR(earliest_data, 'YYYY-MM-DD HH24:MI') as data_start,
  TO_CHAR(latest_data, 'YYYY-MM-DD HH24:MI') as data_end,
  EXTRACT(days FROM time_span) as retention_days,
  ROUND(avg_docs_per_day::numeric, 0) as daily_ingestion_rate,
  ROUND(overall_quality * 100, 1) as quality_percentage,
  unique_sensors,
  unique_devices,
  volume_category,
  ROUND(estimated_storage_bytes / 1024.0 / 1024.0, 1) as estimated_storage_mb,
  recommendation
FROM performance_metrics;

-- QueryLeaf provides comprehensive time series capabilities:
-- 1. SQL-familiar time series collection creation and management
-- 2. High-performance batch data ingestion optimized for IoT workloads  
-- 3. Advanced time bucketing and statistical aggregations
-- 4. Sophisticated anomaly detection using multiple algorithms
-- 5. Predictive maintenance analysis with trend forecasting
-- 6. Automatic data lifecycle management and retention policies
-- 7. Performance monitoring and optimization recommendations
-- 8. Integration with MongoDB's native time series optimizations
-- 9. Real-time analytics with materialized view support
-- 10. Familiar SQL syntax for complex temporal queries and analysis

Best Practices for Time Series Implementation

Data Modeling and Schema Design

Essential practices for optimal time series performance:

Granularity Selection: Choose appropriate time granularity based on data frequency and query patterns
Metadata Organization: Structure metadata fields to optimize automatic bucketing and compression
Measurement Optimization: Use efficient data types and avoid deep nesting for measurements
Index Strategy: Create compound indexes supporting common time range and metadata queries
Retention Policies: Implement automatic expiration aligned with business requirements
Batch Ingestion: Use bulk operations for high-throughput IoT data ingestion

Performance and Scalability

Optimize time series collections for high-performance analytics:

Bucket Sizing: Configure bucket parameters for optimal compression and query performance
Query Optimization: Leverage time series specific aggregation patterns and operators
Resource Planning: Size clusters appropriately for expected data volumes and query loads
Archival Strategy: Implement data lifecycle management with cold storage integration
Monitoring Setup: Track collection performance and optimize based on usage patterns
Downsampling: Use materialized views and pre-aggregated summaries for historical analysis

Conclusion

MongoDB Time Series Collections provide purpose-built capabilities for IoT data management and temporal analytics that eliminate the complexity and limitations of traditional relational approaches. The integration of automatic compression, optimized indexing, and specialized query patterns makes building high-performance time series applications both powerful and efficient.

Key Time Series benefits include:

Purpose-Built Storage: Automatic partitioning and compression optimized for temporal data
High-Performance Ingestion: Optimized for high-frequency IoT data streams
Advanced Analytics: Native support for complex time-based aggregations and window functions
Automatic Lifecycle: Built-in retention policies and data expiration management
Scalable Architecture: Horizontal scaling across sharded clusters for massive datasets
Developer Familiar: SQL-style query patterns with specialized time series operations

Whether you're building IoT monitoring platforms, sensor networks, financial trading systems, or applications requiring time-based analytics, MongoDB Time Series Collections with QueryLeaf's familiar SQL interface provides the foundation for modern temporal data management. This combination enables you to implement sophisticated time series capabilities while preserving familiar database interaction patterns.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB Time Series Collections while providing SQL-familiar time bucketing, statistical aggregations, and temporal analytics. Advanced time series features, anomaly detection, and performance optimization are seamlessly handled through familiar SQL patterns, making high-performance time series analytics both powerful and accessible.

The integration of native time series capabilities with SQL-style operations makes MongoDB an ideal platform for applications requiring both sophisticated temporal analytics and familiar database interaction patterns, ensuring your time series solutions remain both effective and maintainable as they scale and evolve.

September 16, 2025
21 min read

MongoDB Change Streams and Real-Time Data Processing: SQL-Style Event-Driven Architecture for Reactive Applications

Modern applications require real-time responsiveness to data changes - instant notifications, live dashboards, automatic workflow triggers, and synchronized data across distributed systems. Traditional approaches of polling databases for changes create significant performance overhead, introduce latency delays, and consume unnecessary resources while missing the precision and immediacy that users expect from contemporary applications.

MongoDB Change Streams provide enterprise-grade real-time data processing capabilities that monitor database changes as they occur, delivering instant event notifications with complete change context, ordering guarantees, and resumability features. Unlike polling-based approaches or complex trigger systems, Change Streams integrate seamlessly with application architectures to enable reactive programming patterns and event-driven workflows.

The Traditional Change Detection Challenge

Conventional approaches to detecting data changes have significant limitations for real-time applications:

-- Traditional polling approach - inefficient and high-latency
-- Application repeatedly queries database for changes

-- PostgreSQL change detection with polling
CREATE TABLE user_activities (
    activity_id SERIAL PRIMARY KEY,
    user_id INTEGER NOT NULL,
    activity_type VARCHAR(100) NOT NULL,
    activity_data JSONB,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    processed_at TIMESTAMP,
    is_processed BOOLEAN DEFAULT false
);

-- Trigger to update timestamp on changes
CREATE OR REPLACE FUNCTION update_updated_at_column()
RETURNS TRIGGER AS $$
BEGIN
    NEW.updated_at = CURRENT_TIMESTAMP;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER update_user_activities_updated_at
    BEFORE UPDATE ON user_activities
    FOR EACH ROW EXECUTE FUNCTION update_updated_at_column();

-- Application polling for changes (inefficient)
-- This query runs continuously every few seconds
SELECT 
    activity_id,
    user_id,
    activity_type,
    activity_data,
    created_at,
    updated_at
FROM user_activities 
WHERE (updated_at > @last_poll_time OR created_at > @last_poll_time)
  AND is_processed = false
ORDER BY created_at, updated_at
LIMIT 1000;

-- Update processed records
UPDATE user_activities 
SET is_processed = true, processed_at = CURRENT_TIMESTAMP
WHERE activity_id IN (@processed_ids);

-- Problems with polling approach:
-- 1. High database load from constant polling queries
-- 2. Polling frequency vs. latency tradeoff (faster polling = more load)
-- 3. Potential race conditions with concurrent processors
-- 4. No ordering guarantees across multiple tables
-- 5. Missed changes during application downtime
-- 6. Complex state management for resuming processing
-- 7. Difficult to scale across multiple application instances
-- 8. Resource waste during periods of no activity

-- Database triggers approach - limited and fragile
CREATE OR REPLACE FUNCTION notify_change()
RETURNS TRIGGER AS $$
BEGIN
    -- Limited payload size in PostgreSQL notifications
    PERFORM pg_notify(
        'user_activity_change',
        json_build_object(
            'operation', TG_OP,
            'table', TG_TABLE_NAME,
            'id', COALESCE(NEW.activity_id, OLD.activity_id)
        )::text
    );

    RETURN COALESCE(NEW, OLD);
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER user_activities_change_trigger
    AFTER INSERT OR UPDATE OR DELETE ON user_activities
    FOR EACH ROW EXECUTE FUNCTION notify_change();

-- Application listening for notifications
-- Limited payload, no automatic reconnection, fragile connections
LISTEN user_activity_change;

-- Trigger limitations:
-- - Limited payload size (8000 bytes in PostgreSQL)
-- - Connection-based, not resilient to network issues  
-- - No built-in resume capability after disconnection
-- - Complex coordination across multiple database connections
-- - Difficult to filter events at database level
-- - No ordering guarantees across transactions
-- - Performance impact on write operations

MongoDB Change Streams provide comprehensive real-time change processing:

// MongoDB Change Streams - enterprise-grade real-time data processing
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('production_app');

// Comprehensive change stream with advanced filtering and processing
async function setupAdvancedChangeStream() {
  // Create change stream with sophisticated pipeline filtering
  const changeStream = db.collection('user_activities').watch([
    // Match specific operations and conditions
    {
      $match: {
        $and: [
          // Only monitor insert and update operations
          { operationType: { $in: ['insert', 'update', 'replace'] } },

          // Filter by activity types we care about
          {
            $or: [
              { 'fullDocument.activity_type': { $in: ['purchase', 'login', 'signup'] } },
              { 'updateDescription.updatedFields.status': { $exists: true } },
              { 'fullDocument.priority': 'high' }
            ]
          },

          // Only process activities for active users
          { 'fullDocument.user_status': 'active' },

          // Exclude system-generated activities
          { 'fullDocument.source': { $ne: 'system_maintenance' } }
        ]
      }
    },

    // Enrich change events with additional context
    {
      $lookup: {
        from: 'users',
        localField: 'fullDocument.user_id',
        foreignField: '_id',
        as: 'user_info'
      }
    },

    // Add computed fields for processing
    {
      $addFields: {
        processedAt: new Date(),
        changeId: { $toString: '$_id' },
        user: { $arrayElemAt: ['$user_info', 0] },

        // Categorize change types
        changeCategory: {
          $switch: {
            branches: [
              { case: { $eq: ['$operationType', 'insert'] }, then: 'new_activity' },
              { 
                case: { 
                  $and: [
                    { $eq: ['$operationType', 'update'] },
                    { $ifNull: ['$updateDescription.updatedFields.status', false] }
                  ]
                }, 
                then: 'status_change' 
              },
              { case: { $eq: ['$operationType', 'replace'] }, then: 'activity_replaced' }
            ],
            default: 'other_change'
          }
        },

        // Priority scoring
        priorityScore: {
          $switch: {
            branches: [
              { case: { $eq: ['$fullDocument.activity_type', 'purchase'] }, then: 10 },
              { case: { $eq: ['$fullDocument.activity_type', 'signup'] }, then: 8 },
              { case: { $eq: ['$fullDocument.activity_type', 'login'] }, then: 3 },
              { case: { $eq: ['$fullDocument.priority', 'high'] }, then: 9 }
            ],
            default: 5
          }
        }
      }
    },

    // Project final change document structure
    {
      $project: {
        changeId: 1,
        operationType: 1,
        changeCategory: 1,
        priorityScore: 1,
        processedAt: 1,
        clusterTime: 1,

        // Original document data
        documentKey: 1,
        fullDocument: 1,
        updateDescription: 1,

        // User context
        'user.username': 1,
        'user.email': 1,
        'user.subscription_type': 1,
        'user.segment': 1,

        // Metadata
        ns: 1,
        to: 1
      }
    }
  ], {
    // Change stream options
    fullDocument: 'updateLookup',        // Always include full document
    fullDocumentBeforeChange: 'whenAvailable', // Include before-change document
    resumeAfter: null,                   // Resume token (set from previous session)
    startAtOperationTime: null,          // Start from specific time
    maxAwaitTimeMS: 1000,               // Maximum time to wait for changes
    batchSize: 100,                      // Batch size for change events
    collation: { locale: 'en', strength: 2 } // Collation for text matching
  });

  // Process change stream events
  console.log('Monitoring user activities for real-time changes...');

  for await (const change of changeStream) {
    try {
      await processChangeEvent(change);

      // Store resume token for fault tolerance
      await storeResumeToken(change._id);

    } catch (error) {
      console.error('Error processing change event:', error);

      // Implement error handling strategy
      await handleChangeProcessingError(change, error);
    }
  }
}

// Sophisticated change event processing
async function processChangeEvent(change) {
  console.log(`Processing ${change.changeCategory} event:`, {
    changeId: change.changeId,
    operationType: change.operationType,
    priority: change.priorityScore,
    user: change.user?.username,
    timestamp: change.processedAt
  });

  // Route change events based on type and priority
  switch (change.changeCategory) {
    case 'new_activity':
      await handleNewActivity(change);
      break;

    case 'status_change':
      await handleStatusChange(change);
      break;

    case 'activity_replaced':
      await handleActivityReplacement(change);
      break;

    default:
      await handleGenericChange(change);
  }

  // Emit real-time event to connected clients
  await emitRealTimeEvent(change);

  // Update analytics and metrics
  await updateRealtimeMetrics(change);
}

async function handleNewActivity(change) {
  const activity = change.fullDocument;
  const user = change.user;

  // Process high-priority activities immediately
  if (change.priorityScore >= 8) {
    await processHighPriorityActivity(activity, user);
  }

  // Trigger automated workflows
  switch (activity.activity_type) {
    case 'purchase':
      await triggerPurchaseWorkflow(activity, user);
      break;

    case 'signup':
      await triggerOnboardingWorkflow(activity, user);
      break;

    case 'login':
      await updateUserSession(activity, user);
      break;
  }

  // Update real-time dashboards
  await updateLiveDashboard('new_activity', {
    activityType: activity.activity_type,
    userId: activity.user_id,
    userSegment: user.segment,
    timestamp: activity.created_at
  });
}

async function handleStatusChange(change) {
  const updatedFields = change.updateDescription.updatedFields;
  const activity = change.fullDocument;

  // Process status-specific logic
  if (updatedFields.status) {
    console.log(`Activity status changed: ${updatedFields.status}`);

    switch (updatedFields.status) {
      case 'completed':
        await handleActivityCompletion(activity);
        break;

      case 'failed':
        await handleActivityFailure(activity);
        break;

      case 'cancelled':
        await handleActivityCancellation(activity);
        break;
    }
  }

  // Notify interested parties
  await sendStatusChangeNotification(change);
}

// Benefits of MongoDB Change Streams:
// - Real-time event delivery with sub-second latency
// - Complete change context including before/after state
// - Resumable streams with automatic fault tolerance
// - Advanced filtering and transformation capabilities
// - Ordering guarantees within and across collections
// - Integration with existing MongoDB infrastructure
// - Scalable across sharded clusters and replica sets
// - Built-in authentication and authorization
// - No polling overhead or resource waste
// - Developer-friendly API with powerful aggregation pipeline

Understanding MongoDB Change Streams Architecture

Advanced Change Stream Configuration and Management

Implement comprehensive change stream management for production environments:

// Advanced change stream management system
class MongoChangeStreamManager {
  constructor(client, options = {}) {
    this.client = client;
    this.db = client.db(options.database || 'production');
    this.options = {
      // Stream configuration
      maxRetries: options.maxRetries || 10,
      retryDelay: options.retryDelay || 1000,
      batchSize: options.batchSize || 100,
      maxAwaitTimeMS: options.maxAwaitTimeMS || 1000,

      // Resume configuration
      enableResume: options.enableResume !== false,
      resumeTokenStorage: options.resumeTokenStorage || 'mongodb',

      // Error handling
      errorRetryStrategies: options.errorRetryStrategies || ['exponential_backoff', 'circuit_breaker'],

      // Monitoring
      enableMetrics: options.enableMetrics !== false,
      metricsInterval: options.metricsInterval || 30000,

      ...options
    };

    this.activeStreams = new Map();
    this.resumeTokens = new Map();
    this.streamMetrics = new Map();
    this.eventHandlers = new Map();
    this.isShuttingDown = false;
  }

  async createChangeStream(streamConfig) {
    const {
      streamId,
      collection,
      pipeline = [],
      options = {},
      eventHandlers = {}
    } = streamConfig;

    if (this.activeStreams.has(streamId)) {
      throw new Error(`Change stream with ID '${streamId}' already exists`);
    }

    // Build comprehensive change stream pipeline
    const changeStreamPipeline = [
      // Base filtering
      {
        $match: {
          $and: [
            // Operation type filtering
            streamConfig.operationTypes ? {
              operationType: { $in: streamConfig.operationTypes }
            } : {},

            // Namespace filtering
            streamConfig.namespaces ? {
              'ns.coll': { $in: streamConfig.namespaces.map(ns => ns.collection || ns) }
            } : {},

            // Custom filtering
            ...(streamConfig.filters || [])
          ].filter(filter => Object.keys(filter).length > 0)
        }
      },

      // Enrichment lookups
      ...(streamConfig.enrichments || []).map(enrichment => ({
        $lookup: {
          from: enrichment.from,
          localField: enrichment.localField,
          foreignField: enrichment.foreignField,
          as: enrichment.as,
          pipeline: enrichment.pipeline || []
        }
      })),

      // Computed fields
      {
        $addFields: {
          streamId: streamId,
          processedAt: new Date(),
          changeId: { $toString: '$_id' },

          // Change categorization
          changeCategory: streamConfig.categorization || {
            $switch: {
              branches: [
                { case: { $eq: ['$operationType', 'insert'] }, then: 'create' },
                { case: { $eq: ['$operationType', 'update'] }, then: 'update' },
                { case: { $eq: ['$operationType', 'replace'] }, then: 'replace' },
                { case: { $eq: ['$operationType', 'delete'] }, then: 'delete' }
              ],
              default: 'other'
            }
          },

          // Priority scoring
          priority: streamConfig.priorityScoring || 5,

          // Custom computed fields
          ...streamConfig.computedFields || {}
        }
      },

      // Additional pipeline stages
      ...pipeline,

      // Final projection
      {
        $project: {
          _id: 1,
          streamId: 1,
          changeId: 1,
          processedAt: 1,
          operationType: 1,
          changeCategory: 1,
          priority: 1,
          clusterTime: 1,
          documentKey: 1,
          fullDocument: 1,
          updateDescription: 1,
          ns: 1,
          to: 1,
          ...streamConfig.additionalProjection || {}
        }
      }
    ];

    // Configure change stream options
    const changeStreamOptions = {
      fullDocument: streamConfig.fullDocument || 'updateLookup',
      fullDocumentBeforeChange: streamConfig.fullDocumentBeforeChange || 'whenAvailable',
      resumeAfter: await this.getStoredResumeToken(streamId),
      maxAwaitTimeMS: this.options.maxAwaitTimeMS,
      batchSize: this.options.batchSize,
      ...options
    };

    // Create change stream
    const changeStream = collection ? 
      this.db.collection(collection).watch(changeStreamPipeline, changeStreamOptions) :
      this.db.watch(changeStreamPipeline, changeStreamOptions);

    // Store stream configuration
    this.activeStreams.set(streamId, {
      stream: changeStream,
      config: streamConfig,
      options: changeStreamOptions,
      createdAt: new Date(),
      lastEventAt: null,
      eventCount: 0,
      errorCount: 0,
      retryCount: 0
    });

    // Initialize metrics
    this.streamMetrics.set(streamId, {
      eventsProcessed: 0,
      errorsEncountered: 0,
      avgProcessingTime: 0,
      lastProcessingTime: 0,
      throughputHistory: [],
      errorHistory: [],
      resumeHistory: []
    });

    // Store event handlers
    this.eventHandlers.set(streamId, eventHandlers);

    // Start processing
    this.processChangeStream(streamId);

    console.log(`Change stream '${streamId}' created and started`);
    return streamId;
  }

  async processChangeStream(streamId) {
    const streamInfo = this.activeStreams.get(streamId);
    const metrics = this.streamMetrics.get(streamId);
    const handlers = this.eventHandlers.get(streamId);

    if (!streamInfo) {
      console.error(`Change stream '${streamId}' not found`);
      return;
    }

    const { stream, config } = streamInfo;

    try {
      console.log(`Starting event processing for stream: ${streamId}`);

      for await (const change of stream) {
        if (this.isShuttingDown) {
          console.log(`Shutting down stream: ${streamId}`);
          break;
        }

        const processingStartTime = Date.now();

        try {
          // Process the change event
          await this.processChangeEvent(streamId, change, handlers);

          // Update metrics
          const processingTime = Date.now() - processingStartTime;
          this.updateStreamMetrics(streamId, processingTime, true);

          // Store resume token
          await this.storeResumeToken(streamId, change._id);

          // Update stream info
          streamInfo.lastEventAt = new Date();
          streamInfo.eventCount++;

        } catch (error) {
          console.error(`Error processing change event in stream '${streamId}':`, error);

          // Update error metrics
          const processingTime = Date.now() - processingStartTime;
          this.updateStreamMetrics(streamId, processingTime, false);

          streamInfo.errorCount++;

          // Handle processing error
          await this.handleProcessingError(streamId, change, error);
        }
      }

    } catch (error) {
      console.error(`Change stream '${streamId}' encountered error:`, error);

      if (!this.isShuttingDown) {
        await this.handleStreamError(streamId, error);
      }
    }
  }

  async processChangeEvent(streamId, change, handlers) {
    // Route to appropriate handler based on change type
    const handlerKey = change.changeCategory || change.operationType;
    const handler = handlers[handlerKey] || handlers.default || this.defaultEventHandler;

    if (typeof handler === 'function') {
      await handler(change, {
        streamId,
        metrics: this.streamMetrics.get(streamId),
        resumeToken: change._id
      });
    } else {
      console.warn(`No handler found for change type '${handlerKey}' in stream '${streamId}'`);
    }
  }

  async defaultEventHandler(change, context) {
    console.log(`Default handler processing change:`, {
      streamId: context.streamId,
      changeId: change.changeId,
      operationType: change.operationType,
      collection: change.ns?.coll
    });
  }

  updateStreamMetrics(streamId, processingTime, success) {
    const metrics = this.streamMetrics.get(streamId);
    if (!metrics) return;

    metrics.eventsProcessed++;
    metrics.lastProcessingTime = processingTime;

    // Update average processing time (exponential moving average)
    metrics.avgProcessingTime = (metrics.avgProcessingTime * 0.9) + (processingTime * 0.1);

    if (success) {
      // Update throughput history
      metrics.throughputHistory.push({
        timestamp: Date.now(),
        processingTime: processingTime
      });

      // Keep only recent history
      if (metrics.throughputHistory.length > 1000) {
        metrics.throughputHistory.shift();
      }
    } else {
      metrics.errorsEncountered++;

      // Record error
      metrics.errorHistory.push({
        timestamp: Date.now(),
        processingTime: processingTime
      });

      // Keep only recent error history
      if (metrics.errorHistory.length > 100) {
        metrics.errorHistory.shift();
      }
    }
  }

  async handleProcessingError(streamId, change, error) {
    const streamInfo = this.activeStreams.get(streamId);
    const config = streamInfo?.config;

    // Log error details
    console.error(`Processing error in stream '${streamId}':`, {
      changeId: change.changeId,
      operationType: change.operationType,
      error: error.message
    });

    // Apply error handling strategies
    if (config?.errorHandling) {
      const strategy = config.errorHandling.strategy || 'log';

      switch (strategy) {
        case 'retry':
          await this.retryChangeEvent(streamId, change, error);
          break;

        case 'deadletter':
          await this.sendToDeadLetter(streamId, change, error);
          break;

        case 'skip':
          console.warn(`Skipping failed change event: ${change.changeId}`);
          break;

        case 'stop_stream':
          console.error(`Stopping stream '${streamId}' due to processing error`);
          await this.stopChangeStream(streamId);
          break;

        default:
          console.error(`Unhandled processing error in stream '${streamId}'`);
      }
    }
  }

  async handleStreamError(streamId, error) {
    const streamInfo = this.activeStreams.get(streamId);
    if (!streamInfo) return;

    console.error(`Stream error in '${streamId}':`, error.message);

    // Increment retry count
    streamInfo.retryCount++;

    // Check if we should retry
    if (streamInfo.retryCount <= this.options.maxRetries) {
      console.log(`Retrying stream '${streamId}' (attempt ${streamInfo.retryCount})`);

      // Exponential backoff
      const delay = this.options.retryDelay * Math.pow(2, streamInfo.retryCount - 1);
      await this.sleep(delay);

      // Record resume attempt
      const metrics = this.streamMetrics.get(streamId);
      if (metrics) {
        metrics.resumeHistory.push({
          timestamp: Date.now(),
          attempt: streamInfo.retryCount,
          error: error.message
        });
      }

      // Restart the stream
      await this.restartChangeStream(streamId);
    } else {
      console.error(`Maximum retries exceeded for stream '${streamId}'. Marking as failed.`);
      streamInfo.status = 'failed';
      streamInfo.lastError = error;
    }
  }

  async restartChangeStream(streamId) {
    const streamInfo = this.activeStreams.get(streamId);
    if (!streamInfo) return;

    console.log(`Restarting change stream: ${streamId}`);

    try {
      // Close existing stream
      await streamInfo.stream.close();
    } catch (closeError) {
      console.warn(`Error closing stream '${streamId}':`, closeError.message);
    }

    // Update stream options with resume token
    const resumeToken = await this.getStoredResumeToken(streamId);
    if (resumeToken) {
      streamInfo.options.resumeAfter = resumeToken;
      console.log(`Resuming stream '${streamId}' from stored token`);
    }

    // Create new change stream
    const changeStreamPipeline = streamInfo.config.pipeline || [];
    const newStream = streamInfo.config.collection ? 
      this.db.collection(streamInfo.config.collection).watch(changeStreamPipeline, streamInfo.options) :
      this.db.watch(changeStreamPipeline, streamInfo.options);

    // Update stream reference
    streamInfo.stream = newStream;
    streamInfo.restartedAt = new Date();

    // Resume processing
    this.processChangeStream(streamId);
  }

  async storeResumeToken(streamId, resumeToken) {
    if (!this.options.enableResume) return;

    this.resumeTokens.set(streamId, {
      token: resumeToken,
      timestamp: new Date()
    });

    // Store persistently based on configuration
    if (this.options.resumeTokenStorage === 'mongodb') {
      await this.db.collection('change_stream_resume_tokens').updateOne(
        { streamId: streamId },
        {
          $set: {
            resumeToken: resumeToken,
            updatedAt: new Date()
          }
        },
        { upsert: true }
      );
    } else if (this.options.resumeTokenStorage === 'redis' && this.redisClient) {
      await this.redisClient.set(
        `resume_token:${streamId}`,
        JSON.stringify({
          token: resumeToken,
          timestamp: new Date()
        })
      );
    }
  }

  async getStoredResumeToken(streamId) {
    if (!this.options.enableResume) return null;

    // Check memory first
    const memoryToken = this.resumeTokens.get(streamId);
    if (memoryToken) {
      return memoryToken.token;
    }

    // Load from persistent storage
    try {
      if (this.options.resumeTokenStorage === 'mongodb') {
        const tokenDoc = await this.db.collection('change_stream_resume_tokens').findOne(
          { streamId: streamId }
        );
        return tokenDoc?.resumeToken || null;
      } else if (this.options.resumeTokenStorage === 'redis' && this.redisClient) {
        const tokenData = await this.redisClient.get(`resume_token:${streamId}`);
        return tokenData ? JSON.parse(tokenData).token : null;
      }
    } catch (error) {
      console.warn(`Error loading resume token for stream '${streamId}':`, error.message);
    }

    return null;
  }

  async stopChangeStream(streamId) {
    const streamInfo = this.activeStreams.get(streamId);
    if (!streamInfo) {
      console.warn(`Change stream '${streamId}' not found`);
      return;
    }

    console.log(`Stopping change stream: ${streamId}`);

    try {
      await streamInfo.stream.close();
      streamInfo.stoppedAt = new Date();
      streamInfo.status = 'stopped';

      console.log(`Change stream '${streamId}' stopped successfully`);
    } catch (error) {
      console.error(`Error stopping stream '${streamId}':`, error);
    }
  }

  async getStreamMetrics(streamId) {
    if (streamId) {
      return {
        streamInfo: this.activeStreams.get(streamId),
        metrics: this.streamMetrics.get(streamId)
      };
    } else {
      // Return metrics for all streams
      const allMetrics = {};
      for (const [id, streamInfo] of this.activeStreams.entries()) {
        allMetrics[id] = {
          streamInfo: streamInfo,
          metrics: this.streamMetrics.get(id)
        };
      }
      return allMetrics;
    }
  }

  async startMonitoring() {
    if (this.monitoringInterval) return;

    console.log('Starting change stream monitoring');

    this.monitoringInterval = setInterval(async () => {
      try {
        await this.performHealthCheck();
      } catch (error) {
        console.error('Monitoring check failed:', error);
      }
    }, this.options.metricsInterval);
  }

  async performHealthCheck() {
    for (const [streamId, streamInfo] of this.activeStreams.entries()) {
      const metrics = this.streamMetrics.get(streamId);

      // Check stream health
      const health = this.assessStreamHealth(streamId, streamInfo, metrics);

      if (health.status !== 'healthy') {
        console.warn(`Stream '${streamId}' health check:`, health);
      }

      // Log throughput metrics
      if (metrics.throughputHistory.length > 0) {
        const recentEvents = metrics.throughputHistory.filter(
          event => Date.now() - event.timestamp < 60000 // Last minute
        );

        if (recentEvents.length > 0) {
          const avgThroughput = recentEvents.length; // Events per minute
          console.log(`Stream '${streamId}' throughput: ${avgThroughput} events/minute`);
        }
      }
    }
  }

  assessStreamHealth(streamId, streamInfo, metrics) {
    const health = {
      streamId: streamId,
      status: 'healthy',
      issues: [],
      recommendations: []
    };

    // Check error rate
    if (metrics.errorsEncountered > 0 && metrics.eventsProcessed > 0) {
      const errorRate = (metrics.errorsEncountered / metrics.eventsProcessed) * 100;
      if (errorRate > 10) {
        health.status = 'unhealthy';
        health.issues.push(`High error rate: ${errorRate.toFixed(2)}%`);
        health.recommendations.push('Investigate error patterns and processing logic');
      } else if (errorRate > 5) {
        health.status = 'warning';
        health.issues.push(`Elevated error rate: ${errorRate.toFixed(2)}%`);
      }
    }

    // Check processing performance
    if (metrics.avgProcessingTime > 5000) {
      health.issues.push(`Slow processing: ${metrics.avgProcessingTime.toFixed(0)}ms average`);
      health.recommendations.push('Optimize event processing logic');
      if (health.status === 'healthy') health.status = 'warning';
    }

    // Check stream activity
    const timeSinceLastEvent = streamInfo.lastEventAt ? 
      Date.now() - streamInfo.lastEventAt.getTime() : 
      Date.now() - streamInfo.createdAt.getTime();

    if (timeSinceLastEvent > 3600000) { // 1 hour
      health.issues.push(`No events for ${Math.round(timeSinceLastEvent / 60000)} minutes`);
      health.recommendations.push('Verify data source and stream configuration');
    }

    // Check retry count
    if (streamInfo.retryCount > 3) {
      health.issues.push(`Multiple retries: ${streamInfo.retryCount} attempts`);
      health.recommendations.push('Investigate connection stability and error causes');
      if (health.status === 'healthy') health.status = 'warning';
    }

    return health;
  }

  sleep(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
  }

  async shutdown() {
    console.log('Shutting down change stream manager...');

    this.isShuttingDown = true;

    // Stop monitoring
    if (this.monitoringInterval) {
      clearInterval(this.monitoringInterval);
      this.monitoringInterval = null;
    }

    // Close all active streams
    const closePromises = [];
    for (const [streamId] of this.activeStreams.entries()) {
      closePromises.push(this.stopChangeStream(streamId));
    }

    await Promise.all(closePromises);

    console.log('Change stream manager shutdown complete');
  }
}

Real-Time Event Processing Patterns

Implement sophisticated event processing patterns for different application scenarios:

// Specialized change stream patterns for different use cases
class RealtimeEventPatterns {
  constructor(changeStreamManager) {
    this.csm = changeStreamManager;
    this.eventBus = new EventEmitter();
    this.processors = new Map();
  }

  async setupUserActivityStream() {
    // Real-time user activity monitoring
    return await this.csm.createChangeStream({
      streamId: 'user_activities',
      collection: 'user_activities',
      operationTypes: ['insert', 'update'],

      filters: [
        { 'fullDocument.activity_type': { $in: ['login', 'purchase', 'view', 'search'] } },
        { 'fullDocument.user_id': { $exists: true } }
      ],

      enrichments: [
        {
          from: 'users',
          localField: 'fullDocument.user_id',
          foreignField: '_id',
          as: 'user_data'
        },
        {
          from: 'user_sessions',
          localField: 'fullDocument.session_id',
          foreignField: '_id',
          as: 'session_data'
        }
      ],

      computedFields: {
        activityScore: {
          $switch: {
            branches: [
              { case: { $eq: ['$fullDocument.activity_type', 'purchase'] }, then: 100 },
              { case: { $eq: ['$fullDocument.activity_type', 'login'] }, then: 10 },
              { case: { $eq: ['$fullDocument.activity_type', 'search'] }, then: 5 },
              { case: { $eq: ['$fullDocument.activity_type', 'view'] }, then: 1 }
            ],
            default: 0
          }
        },

        userSegment: { $arrayElemAt: ['$user_data.segment', 0] },
        sessionDuration: { $arrayElemAt: ['$session_data.duration', 0] }
      },

      eventHandlers: {
        insert: async (change, context) => {
          await this.handleNewUserActivity(change);
        },
        update: async (change, context) => {
          await this.handleUserActivityUpdate(change);
        }
      },

      errorHandling: {
        strategy: 'retry',
        maxRetries: 3
      }
    });
  }

  async handleNewUserActivity(change) {
    const activity = change.fullDocument;
    const user = change.user_data?.[0];

    console.log(`New user activity: ${activity.activity_type}`, {
      userId: activity.user_id,
      username: user?.username,
      activityScore: change.activityScore,
      timestamp: activity.created_at
    });

    // Real-time user engagement tracking
    await this.updateUserEngagement(activity, user);

    // Trigger personalization engine
    if (change.activityScore >= 5) {
      await this.triggerPersonalizationUpdate(activity, user);
    }

    // Real-time recommendations
    if (activity.activity_type === 'view' || activity.activity_type === 'search') {
      await this.updateRecommendations(activity, user);
    }

    // Fraud detection for high-value activities
    if (activity.activity_type === 'purchase') {
      await this.analyzeFraudRisk(activity, user, change.session_data?.[0]);
    }

    // Live dashboard updates
    this.eventBus.emit('user_activity', {
      type: 'new_activity',
      activity: activity,
      user: user,
      score: change.activityScore
    });
  }

  async setupOrderProcessingStream() {
    // Real-time order processing and fulfillment
    return await this.csm.createChangeStream({
      streamId: 'order_processing',
      collection: 'orders',
      operationTypes: ['insert', 'update'],

      filters: [
        {
          $or: [
            { operationType: 'insert' },
            { 'updateDescription.updatedFields.status': { $exists: true } }
          ]
        }
      ],

      enrichments: [
        {
          from: 'customers',
          localField: 'fullDocument.customer_id',
          foreignField: '_id',
          as: 'customer_data'
        },
        {
          from: 'inventory',
          localField: 'fullDocument.items.product_id',
          foreignField: '_id',
          as: 'inventory_data'
        }
      ],

      computedFields: {
        orderValue: '$fullDocument.total_amount',
        orderPriority: {
          $switch: {
            branches: [
              { case: { $gt: ['$fullDocument.total_amount', 1000] }, then: 'high' },
              { case: { $gt: ['$fullDocument.total_amount', 500] }, then: 'medium' }
            ],
            default: 'normal'
          }
        },
        customerTier: { $arrayElemAt: ['$customer_data.tier', 0] }
      },

      eventHandlers: {
        insert: async (change, context) => {
          await this.handleNewOrder(change);
        },
        update: async (change, context) => {
          await this.handleOrderStatusChange(change);
        }
      }
    });
  }

  async handleNewOrder(change) {
    const order = change.fullDocument;
    const customer = change.customer_data?.[0];

    console.log(`New order received:`, {
      orderId: order._id,
      customerId: order.customer_id,
      customerTier: change.customerTier,
      orderValue: change.orderValue,
      priority: change.orderPriority
    });

    // Inventory allocation
    await this.allocateInventory(order, change.inventory_data);

    // Payment processing
    if (order.payment_method) {
      await this.processPayment(order, customer);
    }

    // Shipping calculation
    await this.calculateShipping(order, customer);

    // Notification systems
    await this.sendOrderConfirmation(order, customer);

    // Analytics and reporting
    this.eventBus.emit('new_order', {
      order: order,
      customer: customer,
      priority: change.orderPriority,
      value: change.orderValue
    });
  }

  async handleOrderStatusChange(change) {
    const updatedFields = change.updateDescription.updatedFields;
    const order = change.fullDocument;

    if (updatedFields.status) {
      console.log(`Order status changed: ${order._id} -> ${updatedFields.status}`);

      switch (updatedFields.status) {
        case 'confirmed':
          await this.handleOrderConfirmation(order);
          break;
        case 'shipped':
          await this.handleOrderShipment(order);
          break;
        case 'delivered':
          await this.handleOrderDelivery(order);
          break;
        case 'cancelled':
          await this.handleOrderCancellation(order);
          break;
      }

      // Customer notifications
      await this.sendStatusUpdateNotification(order, updatedFields.status);
    }
  }

  async setupInventoryManagementStream() {
    // Real-time inventory tracking and alerts
    return await this.csm.createChangeStream({
      streamId: 'inventory_management',
      collection: 'inventory',
      operationTypes: ['update'],

      filters: [
        {
          $or: [
            { 'updateDescription.updatedFields.quantity': { $exists: true } },
            { 'updateDescription.updatedFields.reserved_quantity': { $exists: true } },
            { 'updateDescription.updatedFields.available_quantity': { $exists: true } }
          ]
        }
      ],

      enrichments: [
        {
          from: 'products',
          localField: 'documentKey._id',
          foreignField: 'inventory_id',
          as: 'product_data'
        }
      ],

      computedFields: {
        stockLevel: '$fullDocument.available_quantity',
        reorderThreshold: '$fullDocument.reorder_level',
        stockStatus: {
          $cond: {
            if: { $lte: ['$fullDocument.available_quantity', '$fullDocument.reorder_level'] },
            then: 'low_stock',
            else: 'in_stock'
          }
        }
      },

      eventHandlers: {
        update: async (change, context) => {
          await this.handleInventoryChange(change);
        }
      }
    });
  }

  async handleInventoryChange(change) {
    const inventory = change.fullDocument;
    const updatedFields = change.updateDescription.updatedFields;
    const product = change.product_data?.[0];

    console.log(`Inventory updated:`, {
      productId: product?._id,
      productName: product?.name,
      previousQuantity: updatedFields.quantity,
      currentQuantity: inventory.available_quantity,
      stockStatus: change.stockStatus
    });

    // Low stock alerts
    if (change.stockStatus === 'low_stock') {
      await this.triggerLowStockAlert(inventory, product);
    }

    // Out of stock handling
    if (inventory.available_quantity <= 0) {
      await this.handleOutOfStock(inventory, product);
    }

    // Automatic reordering
    if (inventory.auto_reorder && inventory.available_quantity <= inventory.reorder_level) {
      await this.triggerAutomaticReorder(inventory, product);
    }

    // Live inventory dashboard
    this.eventBus.emit('inventory_change', {
      inventory: inventory,
      product: product,
      stockStatus: change.stockStatus,
      quantityChange: updatedFields.quantity ? 
        inventory.available_quantity - updatedFields.quantity : 0
    });
  }

  async setupMultiCollectionStream() {
    // Monitor changes across multiple collections
    return await this.csm.createChangeStream({
      streamId: 'multi_collection_monitor',
      operationTypes: ['insert', 'update', 'delete'],

      filters: [
        {
          'ns.coll': { 
            $in: ['users', 'orders', 'products', 'reviews'] 
          }
        }
      ],

      computedFields: {
        collectionType: '$ns.coll',
        businessImpact: {
          $switch: {
            branches: [
              { case: { $eq: ['$ns.coll', 'orders'] }, then: 'high' },
              { case: { $eq: ['$ns.coll', 'users'] }, then: 'medium' },
              { case: { $eq: ['$ns.coll', 'products'] }, then: 'medium' },
              { case: { $eq: ['$ns.coll', 'reviews'] }, then: 'low' }
            ],
            default: 'unknown'
          }
        }
      },

      eventHandlers: {
        insert: async (change, context) => {
          await this.handleMultiCollectionInsert(change);
        },
        update: async (change, context) => {
          await this.handleMultiCollectionUpdate(change);
        },
        delete: async (change, context) => {
          await this.handleMultiCollectionDelete(change);
        }
      }
    });
  }

  async handleMultiCollectionInsert(change) {
    const collection = change.ns.coll;

    switch (collection) {
      case 'users':
        await this.handleNewUser(change.fullDocument);
        break;
      case 'orders':
        await this.handleNewOrder(change);
        break;
      case 'products':
        await this.handleNewProduct(change.fullDocument);
        break;
      case 'reviews':
        await this.handleNewReview(change.fullDocument);
        break;
    }

    // Cross-collection analytics
    await this.updateCrossCollectionMetrics(collection, 'insert');
  }

  async setupAggregationUpdateStream() {
    // Monitor changes that require aggregation updates
    return await this.csm.createChangeStream({
      streamId: 'aggregation_updates',
      operationTypes: ['insert', 'update', 'delete'],

      filters: [
        {
          $or: [
            // Order changes affecting customer metrics
            { 
              $and: [
                { 'ns.coll': 'orders' },
                { 'fullDocument.status': 'completed' }
              ]
            },
            // Review changes affecting product ratings
            { 'ns.coll': 'reviews' },
            // Activity changes affecting user engagement
            { 
              $and: [
                { 'ns.coll': 'user_activities' },
                { 'fullDocument.activity_type': { $in: ['purchase', 'view', 'like'] } }
              ]
            }
          ]
        }
      ],

      eventHandlers: {
        default: async (change, context) => {
          await this.handleAggregationUpdate(change);
        }
      }
    });
  }

  async handleAggregationUpdate(change) {
    const collection = change.ns.coll;
    const document = change.fullDocument;

    switch (collection) {
      case 'orders':
        if (document.status === 'completed') {
          await this.updateCustomerMetrics(document.customer_id);
          await this.updateProductSalesMetrics(document.items);
        }
        break;

      case 'reviews':
        await this.updateProductRatings(document.product_id);
        break;

      case 'user_activities':
        await this.updateUserEngagementMetrics(document.user_id);
        break;
    }
  }

  // Analytics and Metrics Updates
  async updateUserEngagement(activity, user) {
    // Update real-time user engagement metrics
    const engagementUpdate = {
      $inc: {
        'metrics.total_activities': 1,
        [`metrics.activity_counts.${activity.activity_type}`]: 1
      },
      $set: {
        'metrics.last_activity': activity.created_at,
        'metrics.updated_at': new Date()
      }
    };

    await this.csm.db.collection('user_engagement').updateOne(
      { user_id: activity.user_id },
      engagementUpdate,
      { upsert: true }
    );
  }

  async updateCustomerMetrics(customerId) {
    // Recalculate customer lifetime value and order metrics
    const pipeline = [
      { $match: { customer_id: customerId, status: 'completed' } },
      {
        $group: {
          _id: '$customer_id',
          totalOrders: { $sum: 1 },
          totalSpent: { $sum: '$total_amount' },
          avgOrderValue: { $avg: '$total_amount' },
          lastOrderDate: { $max: '$created_at' },
          firstOrderDate: { $min: '$created_at' }
        }
      }
    ];

    const result = await this.csm.db.collection('orders').aggregate(pipeline).toArray();

    if (result.length > 0) {
      const metrics = result[0];
      await this.csm.db.collection('customer_metrics').updateOne(
        { customer_id: customerId },
        {
          $set: {
            ...metrics,
            updated_at: new Date()
          }
        },
        { upsert: true }
      );
    }
  }

  // Event Bus Integration
  setupEventBusHandlers() {
    this.eventBus.on('user_activity', (data) => {
      // Emit to external systems (WebSocket, message queue, etc.)
      this.emitToExternalSystems('user_activity', data);
    });

    this.eventBus.on('new_order', (data) => {
      this.emitToExternalSystems('new_order', data);
    });

    this.eventBus.on('inventory_change', (data) => {
      this.emitToExternalSystems('inventory_change', data);
    });
  }

  async emitToExternalSystems(eventType, data) {
    // WebSocket broadcasting
    if (this.wsServer) {
      this.wsServer.broadcast(JSON.stringify({
        type: eventType,
        data: data,
        timestamp: new Date()
      }));
    }

    // Message queue publishing
    if (this.messageQueue) {
      await this.messageQueue.publish(eventType, data);
    }

    // Webhook notifications
    if (this.webhookHandler) {
      await this.webhookHandler.notify(eventType, data);
    }
  }

  async shutdown() {
    console.log('Shutting down real-time event patterns...');
    this.eventBus.removeAllListeners();
    await this.csm.shutdown();
  }
}

SQL-Style Change Stream Operations with QueryLeaf

QueryLeaf provides familiar SQL approaches to MongoDB Change Stream configuration and monitoring:

-- QueryLeaf change stream operations with SQL-familiar syntax

-- Create change stream with advanced filtering
CREATE CHANGE_STREAM user_activities_stream ON user_activities
WITH (
  operations = ARRAY['insert', 'update'],
  resume_token_storage = 'mongodb',
  batch_size = 100,
  max_await_time_ms = 1000
)
FILTER (
  activity_type IN ('login', 'purchase', 'view', 'search') AND
  user_id IS NOT NULL AND
  created_at >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
)
ENRICH WITH (
  users ON user_activities.user_id = users._id AS user_data,
  user_sessions ON user_activities.session_id = user_sessions._id AS session_data
)
COMPUTE (
  activity_score = CASE 
    WHEN activity_type = 'purchase' THEN 100
    WHEN activity_type = 'login' THEN 10
    WHEN activity_type = 'search' THEN 5
    WHEN activity_type = 'view' THEN 1
    ELSE 0
  END,
  user_segment = user_data.segment,
  session_duration = session_data.duration
);

-- Monitor change stream with real-time processing
SELECT 
  change_id,
  operation_type,
  collection_name,
  document_key,
  cluster_time,

  -- Document data
  full_document,
  update_description,

  -- Computed fields from stream
  activity_score,
  user_segment,
  session_duration,

  -- Change categorization
  CASE 
    WHEN operation_type = 'insert' THEN 'new_activity'
    WHEN operation_type = 'update' AND update_description.updated_fields ? 'status' THEN 'status_change'
    WHEN operation_type = 'update' THEN 'activity_updated'
    ELSE 'other'
  END as change_category,

  -- Priority assessment
  CASE
    WHEN activity_score >= 50 THEN 'high'
    WHEN activity_score >= 10 THEN 'medium'
    ELSE 'low'
  END as priority_level,

  processed_at

FROM CHANGE_STREAM('user_activities_stream')
WHERE activity_score > 0
ORDER BY activity_score DESC, cluster_time ASC;

-- Multi-collection change stream monitoring
CREATE CHANGE_STREAM business_events_stream
WITH (
  operations = ARRAY['insert', 'update', 'delete'],
  full_document = 'updateLookup',
  full_document_before_change = 'whenAvailable'
)
FILTER (
  collection_name IN ('orders', 'users', 'products', 'inventory') AND
  (
    -- High-impact order changes
    (collection_name = 'orders' AND operation_type IN ('insert', 'update')) OR
    -- User registration and profile updates
    (collection_name = 'users' AND (operation_type = 'insert' OR update_description.updated_fields ? 'subscription_type')) OR
    -- Product catalog changes
    (collection_name = 'products' AND update_description.updated_fields ? 'price') OR
    -- Inventory level changes
    (collection_name = 'inventory' AND update_description.updated_fields ? 'available_quantity')
  )
);

-- Real-time analytics from change streams
WITH change_stream_analytics AS (
  SELECT 
    collection_name,
    operation_type,
    DATE_TRUNC('minute', cluster_time) as time_bucket,

    -- Event counts
    COUNT(*) as event_count,
    COUNT(*) FILTER (WHERE operation_type = 'insert') as inserts,
    COUNT(*) FILTER (WHERE operation_type = 'update') as updates,
    COUNT(*) FILTER (WHERE operation_type = 'delete') as deletes,

    -- Business metrics
    CASE collection_name
      WHEN 'orders' THEN 
        SUM(CASE WHEN operation_type = 'insert' THEN (full_document->>'total_amount')::numeric ELSE 0 END)
      ELSE 0
    END as revenue_impact,

    CASE collection_name
      WHEN 'inventory' THEN
        SUM(CASE 
          WHEN update_description.updated_fields ? 'available_quantity' 
          THEN (full_document->>'available_quantity')::int - (update_description.updated_fields->>'available_quantity')::int
          ELSE 0
        END)
      ELSE 0  
    END as inventory_change,

    -- Processing performance
    AVG(EXTRACT(EPOCH FROM (processed_at - cluster_time))) as avg_processing_latency_seconds,
    MAX(EXTRACT(EPOCH FROM (processed_at - cluster_time))) as max_processing_latency_seconds

  FROM CHANGE_STREAM('business_events_stream')
  WHERE cluster_time >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
  GROUP BY collection_name, operation_type, DATE_TRUNC('minute', cluster_time)
),

real_time_dashboard AS (
  SELECT 
    time_bucket,

    -- Overall activity metrics
    SUM(event_count) as total_events,
    SUM(inserts) as total_inserts,
    SUM(updates) as total_updates,
    SUM(deletes) as total_deletes,

    -- Business KPIs
    SUM(revenue_impact) as minute_revenue,
    SUM(inventory_change) as net_inventory_change,

    -- Performance metrics
    AVG(avg_processing_latency_seconds) as avg_latency,
    MAX(max_processing_latency_seconds) as max_latency,

    -- Collection breakdown
    json_object_agg(
      collection_name,
      json_build_object(
        'events', event_count,
        'inserts', inserts,
        'updates', updates,
        'deletes', deletes
      )
    ) as collection_breakdown,

    -- Alerts and anomalies
    CASE 
      WHEN SUM(event_count) > 1000 THEN 'high_volume'
      WHEN AVG(avg_processing_latency_seconds) > 5 THEN 'high_latency'
      WHEN SUM(revenue_impact) < 0 THEN 'revenue_concern'
      ELSE 'normal'
    END as alert_status

  FROM change_stream_analytics
  GROUP BY time_bucket
)

SELECT 
  time_bucket,
  total_events,
  total_inserts,
  total_updates,
  total_deletes,
  ROUND(minute_revenue, 2) as revenue_per_minute,
  net_inventory_change,
  ROUND(avg_latency, 3) as avg_processing_seconds,
  ROUND(max_latency, 3) as max_processing_seconds,
  collection_breakdown,
  alert_status,

  -- Trend indicators
  LAG(total_events, 1) OVER (ORDER BY time_bucket) as prev_minute_events,
  ROUND(
    (total_events - LAG(total_events, 1) OVER (ORDER BY time_bucket))::numeric / 
    NULLIF(LAG(total_events, 1) OVER (ORDER BY time_bucket), 0) * 100,
    1
  ) as event_growth_pct,

  ROUND(
    (minute_revenue - LAG(minute_revenue, 1) OVER (ORDER BY time_bucket))::numeric / 
    NULLIF(LAG(minute_revenue, 1) OVER (ORDER BY time_bucket), 0) * 100,
    1
  ) as revenue_growth_pct

FROM real_time_dashboard
ORDER BY time_bucket DESC
LIMIT 60; -- Last hour of minute-by-minute data

-- Change stream error handling and monitoring
SELECT 
  stream_name,
  stream_status,
  created_at,
  last_event_at,
  event_count,
  error_count,
  retry_count,

  -- Health assessment
  CASE 
    WHEN error_count::float / NULLIF(event_count, 0) > 0.1 THEN 'UNHEALTHY'
    WHEN error_count::float / NULLIF(event_count, 0) > 0.05 THEN 'WARNING'  
    WHEN last_event_at < CURRENT_TIMESTAMP - INTERVAL '1 hour' THEN 'INACTIVE'
    ELSE 'HEALTHY'
  END as health_status,

  -- Performance metrics
  ROUND(error_count::numeric / NULLIF(event_count, 0) * 100, 2) as error_rate_pct,
  EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - last_event_at)) / 60 as minutes_since_last_event,

  -- Resume token status
  CASE 
    WHEN resume_token IS NOT NULL THEN 'RESUMABLE'
    ELSE 'NOT_RESUMABLE'
  END as resume_status,

  -- Recommendations
  CASE 
    WHEN error_count::float / NULLIF(event_count, 0) > 0.1 THEN 'Investigate error patterns and processing logic'
    WHEN retry_count > 5 THEN 'Check connection stability and resource limits'
    WHEN last_event_at < CURRENT_TIMESTAMP - INTERVAL '2 hours' THEN 'Verify data source and stream configuration'
    ELSE 'Stream operating normally'
  END as recommendation

FROM CHANGE_STREAM_STATUS()
ORDER BY 
  CASE health_status
    WHEN 'UNHEALTHY' THEN 1
    WHEN 'WARNING' THEN 2
    WHEN 'INACTIVE' THEN 3
    ELSE 4
  END,
  error_rate_pct DESC NULLS LAST;

-- Event-driven workflow triggers
CREATE TRIGGER real_time_order_processing
ON CHANGE_STREAM('business_events_stream')
WHEN (
  collection_name = 'orders' AND 
  operation_type = 'insert' AND
  full_document->>'status' = 'pending'
)
EXECUTE PROCEDURE (
  -- Inventory allocation
  UPDATE inventory 
  SET reserved_quantity = reserved_quantity + (
    SELECT SUM((item->>'quantity')::int)
    FROM json_array_elements(NEW.full_document->'items') AS item
    WHERE inventory.product_id = (item->>'product_id')::uuid
  ),
  available_quantity = available_quantity - (
    SELECT SUM((item->>'quantity')::int) 
    FROM json_array_elements(NEW.full_document->'items') AS item
    WHERE inventory.product_id = (item->>'product_id')::uuid
  )
  WHERE product_id IN (
    SELECT DISTINCT (item->>'product_id')::uuid
    FROM json_array_elements(NEW.full_document->'items') AS item
  );

  -- Payment processing trigger
  INSERT INTO payment_processing_queue (
    order_id,
    customer_id,
    amount,
    payment_method,
    priority,
    created_at
  )
  VALUES (
    (NEW.full_document->>'_id')::uuid,
    (NEW.full_document->>'customer_id')::uuid,
    (NEW.full_document->>'total_amount')::numeric,
    NEW.full_document->>'payment_method',
    CASE 
      WHEN (NEW.full_document->>'total_amount')::numeric > 1000 THEN 'high'
      ELSE 'normal'
    END,
    CURRENT_TIMESTAMP
  );

  -- Customer notification
  INSERT INTO notification_queue (
    recipient_id,
    notification_type,
    channel,
    message_data,
    created_at
  )
  VALUES (
    (NEW.full_document->>'customer_id')::uuid,
    'order_confirmation',
    'email',
    json_build_object(
      'order_id', NEW.full_document->>'_id',
      'order_total', NEW.full_document->>'total_amount',
      'items_count', json_array_length(NEW.full_document->'items')
    ),
    CURRENT_TIMESTAMP
  );
);

-- Change stream performance optimization
WITH stream_performance AS (
  SELECT 
    stream_name,
    AVG(processing_time_ms) as avg_processing_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY processing_time_ms) as p95_processing_time,
    MAX(processing_time_ms) as max_processing_time,
    COUNT(*) as total_events,
    SUM(CASE WHEN processing_time_ms > 1000 THEN 1 ELSE 0 END) as slow_events,
    AVG(batch_size) as avg_batch_size
  FROM CHANGE_STREAM_METRICS()
  WHERE recorded_at >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
  GROUP BY stream_name
)
SELECT 
  stream_name,
  ROUND(avg_processing_time, 2) as avg_processing_ms,
  ROUND(p95_processing_time, 2) as p95_processing_ms,
  max_processing_time as max_processing_ms,
  total_events,
  ROUND((slow_events::numeric / total_events) * 100, 2) as slow_event_pct,
  ROUND(avg_batch_size, 1) as avg_batch_size,

  -- Performance assessment
  CASE 
    WHEN avg_processing_time > 2000 THEN 'SLOW'
    WHEN slow_events::numeric / total_events > 0.1 THEN 'INCONSISTENT'  
    WHEN avg_batch_size < 10 THEN 'UNDERUTILIZED'
    ELSE 'OPTIMAL'
  END as performance_status,

  -- Optimization recommendations
  CASE
    WHEN avg_processing_time > 2000 THEN 'Optimize event processing logic and reduce complexity'
    WHEN slow_events::numeric / total_events > 0.1 THEN 'Investigate processing bottlenecks and resource constraints'
    WHEN avg_batch_size < 10 THEN 'Increase batch size for better throughput'
    WHEN p95_processing_time > 5000 THEN 'Add error handling and timeout management'
    ELSE 'Performance is within acceptable limits'
  END as optimization_recommendation

FROM stream_performance
ORDER BY avg_processing_time DESC;

-- QueryLeaf provides comprehensive change stream capabilities:
-- 1. SQL-familiar change stream creation and configuration
-- 2. Advanced filtering with complex business logic
-- 3. Real-time enrichment with related collection data
-- 4. Computed fields for event categorization and scoring
-- 5. Multi-collection monitoring with unified interface
-- 6. Real-time analytics and dashboard integration
-- 7. Event-driven workflow automation and triggers
-- 8. Performance monitoring and optimization recommendations
-- 9. Error handling and automatic retry mechanisms
-- 10. Resume capability for fault-tolerant processing

Best Practices for Change Stream Implementation

Design Guidelines

Essential practices for optimal change stream configuration:

Strategic Filtering: Design filters to process only relevant changes and minimize resource usage
Resume Strategy: Implement robust resume token storage for fault-tolerant processing
Error Handling: Build comprehensive error handling with retry strategies and dead letter queues
Performance Monitoring: Track processing latency, throughput, and error rates continuously
Resource Management: Size change stream configurations based on expected data volumes
Event Ordering: Understand and leverage MongoDB's ordering guarantees within and across collections

Scalability and Performance

Optimize change streams for high-throughput, low-latency processing:

Batch Processing: Configure appropriate batch sizes for optimal throughput
Parallel Processing: Distribute change processing across multiple consumers when possible
Resource Allocation: Ensure adequate compute and network resources for real-time processing
Connection Management: Use connection pooling and proper resource cleanup
Monitoring Integration: Integrate with observability tools for production monitoring
Load Testing: Test change stream performance under expected and peak loads

Conclusion

MongoDB Change Streams provide enterprise-grade real-time data processing capabilities that eliminate the complexity and overhead of polling-based change detection while delivering immediate, ordered, and resumable event notifications. The integration of sophisticated filtering, enrichment, and processing capabilities makes building reactive applications and event-driven architectures both powerful and maintainable.

Key Change Streams benefits include:

Real-Time Processing: Sub-second latency for immediate response to data changes
Complete Change Context: Full document state and change details for comprehensive processing
Fault Tolerance: Automatic resume capability and robust error handling mechanisms
Scalable Architecture: Support for high-throughput processing across sharded clusters
Developer Experience: Intuitive API with powerful aggregation pipeline integration
Production Ready: Built-in monitoring, authentication, and operational capabilities

Whether you're building live dashboards, automated workflows, real-time analytics, or event-driven microservices, MongoDB Change Streams with QueryLeaf's familiar SQL interface provides the foundation for reactive data processing. This combination enables you to implement sophisticated real-time capabilities while preserving familiar development patterns and operational approaches.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB Change Stream operations while providing SQL-familiar change detection, event filtering, and real-time processing syntax. Advanced stream configuration, error handling, and performance optimization are seamlessly handled through familiar SQL patterns, making real-time data processing both powerful and accessible.

The integration of native change stream capabilities with SQL-style operations makes MongoDB an ideal platform for applications requiring both real-time responsiveness and familiar database interaction patterns, ensuring your event-driven architecture remains both effective and maintainable as it scales and evolves.

September 15, 2025
27 min read

MongoDB Data Modeling and Schema Design Patterns: SQL-Style Database Design for NoSQL Performance and Flexibility

Modern applications require database designs that can handle complex data relationships, evolving requirements, and massive scale while maintaining query performance and data consistency. Traditional relational database design relies on normalization principles and rigid schema constraints, but often struggles with nested data structures, dynamic attributes, and horizontal scaling demands that characterize modern applications.

MongoDB's document-based data model provides flexible schema design that can adapt to changing requirements while delivering high performance through strategic denormalization and document structure optimization. Unlike relational databases that require complex joins to reassemble related data, MongoDB document modeling can embed related data within single documents, reducing query complexity and improving performance for read-heavy workloads.

The Relational Database Design Challenge

Traditional relational database design approaches face significant limitations with modern application requirements:

-- Traditional relational database design - rigid and join-heavy
-- E-commerce product catalog with complex relationships

CREATE TABLE categories (
    category_id SERIAL PRIMARY KEY,
    category_name VARCHAR(100) NOT NULL,
    parent_category_id INTEGER REFERENCES categories(category_id),
    description TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE brands (
    brand_id SERIAL PRIMARY KEY,
    brand_name VARCHAR(100) NOT NULL UNIQUE,
    brand_description TEXT,
    brand_website VARCHAR(255),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE products (
    product_id SERIAL PRIMARY KEY,
    product_name VARCHAR(255) NOT NULL,
    product_description TEXT,
    category_id INTEGER NOT NULL REFERENCES categories(category_id),
    brand_id INTEGER NOT NULL REFERENCES brands(brand_id),
    base_price DECIMAL(10, 2) NOT NULL,
    weight DECIMAL(8, 3),
    dimensions_length DECIMAL(8, 2),
    dimensions_width DECIMAL(8, 2), 
    dimensions_height DECIMAL(8, 2),
    is_active BOOLEAN DEFAULT true,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE product_attributes (
    attribute_id SERIAL PRIMARY KEY,
    product_id INTEGER NOT NULL REFERENCES products(product_id),
    attribute_name VARCHAR(100) NOT NULL,
    attribute_value TEXT NOT NULL,
    attribute_type VARCHAR(50) DEFAULT 'string',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    UNIQUE(product_id, attribute_name)
);

CREATE TABLE product_images (
    image_id SERIAL PRIMARY KEY,
    product_id INTEGER NOT NULL REFERENCES products(product_id),
    image_url VARCHAR(500) NOT NULL,
    image_alt_text VARCHAR(255),
    display_order INTEGER DEFAULT 0,
    is_primary BOOLEAN DEFAULT false,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE product_variants (
    variant_id SERIAL PRIMARY KEY,
    product_id INTEGER NOT NULL REFERENCES products(product_id),
    variant_name VARCHAR(255) NOT NULL,
    sku VARCHAR(100) UNIQUE,
    price_adjustment DECIMAL(10, 2) DEFAULT 0,
    stock_quantity INTEGER DEFAULT 0,
    variant_attributes JSONB,
    is_active BOOLEAN DEFAULT true,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE product_reviews (
    review_id SERIAL PRIMARY KEY,
    product_id INTEGER NOT NULL REFERENCES products(product_id),
    customer_id INTEGER NOT NULL REFERENCES customers(customer_id),
    rating INTEGER CHECK (rating >= 1 AND rating <= 5),
    review_title VARCHAR(200),
    review_text TEXT,
    is_verified_purchase BOOLEAN DEFAULT false,
    helpful_votes INTEGER DEFAULT 0,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Complex query to get product details with all related data
SELECT 
    p.product_id,
    p.product_name,
    p.product_description,
    p.base_price,

    -- Category hierarchy (requires recursive CTE for full path)
    c.category_name,
    parent_c.category_name as parent_category,

    -- Brand information
    b.brand_name,
    b.brand_description,

    -- Product dimensions
    CASE 
        WHEN p.dimensions_length IS NOT NULL THEN 
            CONCAT(p.dimensions_length, ' x ', p.dimensions_width, ' x ', p.dimensions_height)
        ELSE NULL
    END as dimensions,

    -- Aggregate attributes (problematic with large numbers)
    STRING_AGG(
        CONCAT(pa.attribute_name, ': ', pa.attribute_value), 
        ', ' 
        ORDER BY pa.attribute_name
    ) as attributes,

    -- Primary image
    pi_primary.image_url as primary_image,

    -- Review statistics
    COUNT(DISTINCT pr.review_id) as review_count,
    ROUND(AVG(pr.rating), 2) as average_rating,

    -- Variant count
    COUNT(DISTINCT pv.variant_id) as variant_count,

    -- Stock availability across variants
    SUM(pv.stock_quantity) as total_stock

FROM products p
JOIN categories c ON p.category_id = c.category_id
LEFT JOIN categories parent_c ON c.parent_category_id = parent_c.category_id
JOIN brands b ON p.brand_id = b.brand_id
LEFT JOIN product_attributes pa ON p.product_id = pa.product_id
LEFT JOIN product_images pi_primary ON p.product_id = pi_primary.product_id 
    AND pi_primary.is_primary = true
LEFT JOIN product_variants pv ON p.product_id = pv.product_id 
    AND pv.is_active = true
LEFT JOIN product_reviews pr ON p.product_id = pr.product_id

WHERE p.is_active = true
    AND p.product_id = $1

GROUP BY 
    p.product_id, p.product_name, p.product_description, p.base_price,
    c.category_name, parent_c.category_name,
    b.brand_name, b.brand_description,
    p.dimensions_length, p.dimensions_width, p.dimensions_height,
    pi_primary.image_url;

-- Problems with relational approach:
-- 1. Complex multi-table joins for simple product queries
-- 2. Difficult to add new product attributes without schema changes
-- 3. Poor performance with large numbers of attributes and images
-- 4. Rigid schema prevents storing varying product structures
-- 5. N+1 query problems when loading product catalogs
-- 6. Difficult to handle hierarchical categories efficiently
-- 7. Complex aggregation queries for review statistics
-- 8. Schema migrations required for new product types
-- 9. Inefficient storage of sparse attributes
-- 10. Challenging to implement full-text search across attributes

MongoDB's document-based design eliminates many of these issues:

// MongoDB optimized document design - flexible and performance-oriented
// Single document contains all product information

// Example product document with embedded data
const productDocument = {
  _id: ObjectId("64a1b2c3d4e5f6789012345a"),

  // Basic product information
  name: "MacBook Pro 16-inch M3 Max",
  description: "Powerful laptop for professional workflows with M3 Max chip, stunning Liquid Retina XDR display, and all-day battery life.",
  sku: "MACBOOK-PRO-16-M3MAX-512GB",

  // Category with embedded hierarchy
  category: {
    primary: "Electronics",
    secondary: "Computers & Tablets", 
    tertiary: "Laptops",
    path: ["Electronics", "Computers & Tablets", "Laptops"],
    categoryId: "electronics-computers-laptops"
  },

  // Brand information embedded
  brand: {
    name: "Apple",
    description: "Innovative technology products and solutions",
    website: "https://www.apple.com",
    brandId: "apple"
  },

  // Pricing structure
  pricing: {
    basePrice: 3499.00,
    currency: "USD",
    priceHistory: [
      { price: 3499.00, effectiveDate: ISODate("2024-01-15"), reason: "launch_price" },
      { price: 3299.00, effectiveDate: ISODate("2024-06-01"), reason: "promotional_discount" }
    ],
    currentPrice: 3299.00,
    msrp: 3499.00
  },

  // Physical specifications
  specifications: {
    dimensions: {
      length: 35.57,
      width: 24.81,
      height: 1.68,
      unit: "cm"
    },
    weight: {
      value: 2.16,
      unit: "kg"
    },

    // Technical specifications as flexible object
    technical: {
      processor: "Apple M3 Max chip with 12-core CPU and 38-core GPU",
      memory: "36GB unified memory",
      storage: "512GB SSD storage",
      display: {
        size: "16.2-inch",
        resolution: "3456 x 2234",
        technology: "Liquid Retina XDR",
        brightness: "1000 nits sustained, 1600 nits peak"
      },
      connectivity: [
        "Three Thunderbolt 4 ports",
        "HDMI port", 
        "SDXC card slot",
        "MagSafe 3 charging port",
        "3.5mm headphone jack"
      ],
      wireless: {
        wifi: "Wi-Fi 6E",
        bluetooth: "Bluetooth 5.3"
      },
      operatingSystem: "macOS Sonoma"
    }
  },

  // Flexible attributes array for varying product features
  attributes: [
    { name: "Color", value: "Space Black", type: "string", searchable: true },
    { name: "Screen Size", value: 16.2, type: "number", unit: "inches" },
    { name: "Battery Life", value: "Up to 22 hours", type: "string" },
    { name: "Warranty", value: "1 Year Limited", type: "string" },
    { name: "Touch ID", value: true, type: "boolean" }
  ],

  // Images embedded for faster loading
  images: [
    {
      url: "https://images.example.com/macbook-pro-16-space-black-1.jpg",
      altText: "MacBook Pro 16-inch in Space Black - front view",
      isPrimary: true,
      displayOrder: 1,
      imageType: "product_shot",
      dimensions: { width: 2000, height: 1500 }
    },
    {
      url: "https://images.example.com/macbook-pro-16-space-black-2.jpg", 
      altText: "MacBook Pro 16-inch in Space Black - side view",
      isPrimary: false,
      displayOrder: 2,
      imageType: "product_shot",
      dimensions: { width: 2000, height: 1500 }
    }
  ],

  // Product variants embedded for related configurations
  variants: [
    {
      _id: ObjectId("64a1b2c3d4e5f6789012345b"),
      name: "MacBook Pro 16-inch M3 Max - 1TB",
      sku: "MACBOOK-PRO-16-M3MAX-1TB",
      priceAdjustment: 500.00,
      specifications: {
        storage: "1TB SSD storage",
        memory: "36GB unified memory"
      },
      stockQuantity: 45,
      isActive: true,
      attributes: [
        { name: "Storage", value: "1TB", type: "string" }
      ]
    },
    {
      _id: ObjectId("64a1b2c3d4e5f6789012345c"),
      name: "MacBook Pro 16-inch M3 Max - Silver",
      sku: "MACBOOK-PRO-16-M3MAX-SILVER",
      priceAdjustment: 0.00,
      attributes: [
        { name: "Color", value: "Silver", type: "string" }
      ],
      stockQuantity: 23,
      isActive: true
    }
  ],

  // Inventory and availability
  inventory: {
    stockQuantity: 67,
    reservedQuantity: 3,
    availableQuantity: 64,
    reorderLevel: 10,
    reorderQuantity: 50,
    lastRestocked: ISODate("2024-09-01"),
    supplier: {
      name: "Apple Inc.",
      supplierId: "APPLE_DIRECT",
      leadTimeDays: 7
    }
  },

  // Reviews embedded with summary statistics
  reviews: {
    // Summary statistics for quick access
    summary: {
      totalReviews: 347,
      averageRating: 4.7,
      ratingDistribution: {
        "5": 245,
        "4": 78, 
        "3": 18,
        "2": 4,
        "1": 2
      },
      lastUpdated: ISODate("2024-09-14")
    },

    // Recent reviews embedded (with pagination for full list)
    recent: [
      {
        _id: ObjectId("64a1b2c3d4e5f6789012346a"),
        customerId: ObjectId("64a1b2c3d4e5f678901234aa"),
        customerName: "Sarah Chen",
        rating: 5,
        title: "Exceptional performance for video editing",
        text: "The M3 Max chip handles 4K video editing effortlessly. Battery life is impressive for such a powerful machine.",
        isVerifiedPurchase: true,
        helpfulVotes: 23,
        createdAt: ISODate("2024-09-10"),
        updatedAt: ISODate("2024-09-10")
      }
    ]
  },

  // SEO and search optimization
  seo: {
    metaTitle: "MacBook Pro 16-inch M3 Max - Professional Performance",
    metaDescription: "Experience unmatched performance with the MacBook Pro featuring M3 Max chip, 36GB memory, and stunning 16-inch Liquid Retina XDR display.",
    keywords: ["MacBook Pro", "M3 Max", "16-inch", "laptop", "Apple", "professional"],
    searchTerms: [
      "macbook pro 16 inch",
      "apple laptop", 
      "m3 max",
      "professional laptop",
      "video editing laptop"
    ]
  },

  // Status and metadata
  status: {
    isActive: true,
    isPublished: true,
    isFeatured: true,
    publishedAt: ISODate("2024-01-15"),
    lastModified: ISODate("2024-09-14"),
    version: 3
  },

  // Analytics and performance tracking
  analytics: {
    views: {
      total: 15420,
      thisMonth: 2341,
      uniqueVisitors: 12087
    },
    conversions: {
      addToCart: 892,
      purchases: 156,
      conversionRate: 17.5
    },
    searchPerformance: {
      avgPosition: 2.3,
      clickThroughRate: 8.7,
      impressions: 45230
    }
  },

  // Timestamps for auditing and tracking
  createdAt: ISODate("2024-01-15"),
  updatedAt: ISODate("2024-09-14")
};

// Benefits of MongoDB document design:
// - Single query retrieves complete product information
// - Flexible schema accommodates different product types
// - Embedded related data eliminates joins
// - Rich nested structures for complex specifications
// - Easy to add new attributes without schema changes
// - Efficient storage and retrieval of product hierarchies
// - Native support for arrays and nested objects
// - Simplified application logic with document-oriented design
// - Better performance for product catalog queries
// - Natural fit for JSON-based APIs and front-end applications

Understanding MongoDB Data Modeling Patterns

Document Structure and Embedding Strategies

Strategic document design patterns for optimal performance and maintainability:

// Advanced MongoDB data modeling patterns for different use cases
class MongoDataModelingPatterns {
  constructor(db) {
    this.db = db;
    this.modelingPatterns = new Map();
  }

  // Pattern 1: Embedded Document Pattern
  // Use when: Related data is accessed together, 1:1 or 1:few relationships
  createUserProfileEmbeddedPattern() {
    return {
      _id: ObjectId("64a1b2c3d4e5f6789012347a"),

      // Basic user information
      username: "sarah_dev",
      email: "sarah@example.com",

      // Embedded profile information (1:1 relationship)
      profile: {
        firstName: "Sarah",
        lastName: "Johnson",
        dateOfBirth: ISODate("1990-05-15"),
        avatar: {
          url: "https://images.example.com/avatars/sarah_dev.jpg",
          uploadedAt: ISODate("2024-03-12"),
          size: { width: 200, height: 200 }
        },
        bio: "Full-stack developer passionate about clean code and user experience",
        location: {
          city: "San Francisco",
          state: "CA",
          country: "USA",
          timezone: "America/Los_Angeles"
        },
        socialMedia: {
          github: "https://github.com/sarahdev",
          linkedin: "https://linkedin.com/in/sarah-johnson-dev",
          twitter: "@sarah_codes"
        }
      },

      // Embedded preferences (1:1 relationship)
      preferences: {
        theme: "dark",
        language: "en",
        notifications: {
          email: true,
          push: false,
          sms: false
        },
        privacy: {
          profileVisibility: "public",
          showEmail: false,
          showLocation: true
        }
      },

      // Embedded contact methods (1:few relationship)  
      contactMethods: [
        {
          type: "email",
          value: "sarah@example.com", 
          isPrimary: true,
          isVerified: true,
          verifiedAt: ISODate("2024-01-15")
        },
        {
          type: "phone",
          value: "+1-555-123-4567",
          isPrimary: false,
          isVerified: true,
          verifiedAt: ISODate("2024-01-20")
        }
      ],

      // Embedded skills (1:many but limited)
      skills: [
        { name: "JavaScript", level: "expert", yearsExperience: 8 },
        { name: "Python", level: "advanced", yearsExperience: 5 },
        { name: "MongoDB", level: "intermediate", yearsExperience: 3 },
        { name: "React", level: "expert", yearsExperience: 6 }
      ],

      // Account status and metadata
      account: {
        status: "active",
        type: "premium",
        createdAt: ISODate("2024-01-15"),
        lastLoginAt: ISODate("2024-09-14"),
        loginCount: 342,
        isEmailVerified: true,
        twoFactorEnabled: true
      },

      createdAt: ISODate("2024-01-15"),
      updatedAt: ISODate("2024-09-14")
    };
  }

  // Pattern 2: Reference Pattern  
  // Use when: Large documents, many:many relationships, frequently changing data
  createBlogPostReferencePattern() {
    // Main blog post document
    const blogPost = {
      _id: ObjectId("64a1b2c3d4e5f6789012348a"),
      title: "Advanced MongoDB Data Modeling Techniques",
      slug: "advanced-mongodb-data-modeling-techniques",
      content: "Content of the blog post...",
      excerpt: "Learn advanced techniques for MongoDB data modeling...",

      // Reference to author (many posts : 1 author)
      authorId: ObjectId("64a1b2c3d4e5f6789012347a"),

      // Reference to category (many posts : 1 category)
      categoryId: ObjectId("64a1b2c3d4e5f6789012349a"),

      // References to tags (many posts : many tags)
      tagIds: [
        ObjectId("64a1b2c3d4e5f67890123401"),
        ObjectId("64a1b2c3d4e5f67890123402"), 
        ObjectId("64a1b2c3d4e5f67890123403")
      ],

      // Post metadata
      metadata: {
        publishedAt: ISODate("2024-09-10"),
        status: "published",
        featuredImageUrl: "https://images.example.com/blog/mongodb-modeling.jpg",
        readingTime: 12,
        wordCount: 2400
      },

      // SEO information
      seo: {
        metaTitle: "Advanced MongoDB Data Modeling - Complete Guide",
        metaDescription: "Master MongoDB data modeling with patterns, best practices, and real-world examples.",
        keywords: ["MongoDB", "data modeling", "NoSQL", "database design"]
      },

      // Analytics data
      stats: {
        views: 2340,
        likes: 89,
        shares: 23,
        commentsCount: 15, // Computed field updated by triggers
        averageRating: 4.6
      },

      createdAt: ISODate("2024-09-08"),
      updatedAt: ISODate("2024-09-14")
    };

    // Separate comments collection for scalability
    const blogComments = [
      {
        _id: ObjectId("64a1b2c3d4e5f67890123501"),
        postId: ObjectId("64a1b2c3d4e5f6789012348a"), // Reference to blog post
        authorId: ObjectId("64a1b2c3d4e5f67890123470"), // Reference to user
        content: "Great article! Very helpful examples.",

        // Embedded author info for faster loading (denormalization)
        author: {
          username: "dev_mike",
          avatar: "https://images.example.com/avatars/dev_mike.jpg",
          displayName: "Mike Chen"
        },

        // Support for nested replies
        parentCommentId: null, // Top-level comment
        replyCount: 2,

        // Comment moderation
        status: "approved",
        moderatedBy: ObjectId("64a1b2c3d4e5f67890123500"),
        moderatedAt: ISODate("2024-09-11"),

        // Engagement metrics
        likes: 5,
        dislikes: 0,
        isReported: false,

        createdAt: ISODate("2024-09-11"),
        updatedAt: ISODate("2024-09-11")
      }
    ];

    return { blogPost, blogComments };
  }

  // Pattern 3: Hybrid Pattern (Embedding + Referencing)
  // Use when: Need benefits of both patterns for different aspects
  createOrderHybridPattern() {
    return {
      _id: ObjectId("64a1b2c3d4e5f6789012350a"),
      orderNumber: "ORD-2024-091401",

      // Customer reference (frequent lookups, separate profile management)
      customerId: ObjectId("64a1b2c3d4e5f6789012347a"),

      // Embedded customer snapshot for order history queries
      customerSnapshot: {
        name: "Sarah Johnson",
        email: "sarah@example.com",
        phone: "+1-555-123-4567",
        // Capture customer state at time of order
        membershipLevel: "gold",
        snapshotDate: ISODate("2024-09-14")
      },

      // Embedded order items (order-specific, not shared)
      items: [
        {
          productId: ObjectId("64a1b2c3d4e5f6789012345a"), // Reference for inventory updates

          // Embedded product snapshot to preserve order history
          productSnapshot: {
            name: "MacBook Pro 16-inch M3 Max",
            sku: "MACBOOK-PRO-16-M3MAX-512GB",
            description: "Powerful laptop for professional workflows...",
            image: "https://images.example.com/macbook-pro-16-1.jpg",
            // Capture product state at time of order
            snapshotDate: ISODate("2024-09-14")
          },

          quantity: 1,
          unitPrice: 3299.00,
          totalPrice: 3299.00,

          // Item-specific information
          selectedVariant: {
            color: "Space Black",
            storage: "512GB",
            variantId: ObjectId("64a1b2c3d4e5f6789012345b")
          },

          // Embedded pricing breakdown
          pricing: {
            basePrice: 3499.00,
            discount: 200.00,
            discountReason: "promotional_discount",
            finalPrice: 3299.00,
            tax: 263.92,
            taxRate: 8.0
          }
        }
      ],

      // Embedded shipping information
      shipping: {
        method: "express",
        carrier: "FedEx",
        trackingNumber: "1234567890123456",
        cost: 15.99,

        // Embedded shipping address (snapshot)
        address: {
          name: "Sarah Johnson",
          company: null,
          addressLine1: "123 Tech Street",
          addressLine2: "Apt 4B",
          city: "San Francisco",
          state: "CA",
          postalCode: "94107",
          country: "USA",
          phone: "+1-555-123-4567"
        },

        estimatedDelivery: ISODate("2024-09-16"),
        actualDelivery: null,
        deliveryInstructions: "Leave at door if not home"
      },

      // Embedded billing information
      billing: {
        // Reference to payment method for future use
        paymentMethodId: ObjectId("64a1b2c3d4e5f67890123600"),

        // Embedded payment snapshot
        paymentSnapshot: {
          method: "credit_card",
          last4: "4242",
          brand: "visa",
          expiryMonth: 12,
          expiryYear: 2027,
          // Capture payment method state at time of order
          snapshotDate: ISODate("2024-09-14")
        },

        // Billing address (may differ from shipping)
        address: {
          name: "Sarah Johnson",
          addressLine1: "456 Billing Ave",
          city: "San Francisco",
          state: "CA", 
          postalCode: "94107",
          country: "USA"
        },

        // Payment processing details
        transactionId: "txn_1234567890abcdef",
        processorResponse: "approved",
        authorizationCode: "AUTH123456",
        capturedAt: ISODate("2024-09-14")
      },

      // Order totals and calculations
      totals: {
        subtotal: 3299.00,
        taxAmount: 263.92,
        shippingAmount: 15.99,
        discountAmount: 200.00,
        totalAmount: 3378.91,
        currency: "USD"
      },

      // Order status and timeline
      status: {
        current: "processing",
        timeline: [
          {
            status: "placed",
            timestamp: ISODate("2024-09-14T10:30:00Z"),
            note: "Order successfully placed"
          },
          {
            status: "paid", 
            timestamp: ISODate("2024-09-14T10:30:15Z"),
            note: "Payment processed successfully"
          },
          {
            status: "processing",
            timestamp: ISODate("2024-09-14T11:15:00Z"),
            note: "Order sent to fulfillment center"
          }
        ]
      },

      // Order metadata
      metadata: {
        source: "web",
        campaign: "fall_promotion_2024",
        referrer: "google_ads",
        userAgent: "Mozilla/5.0...",
        ipAddress: "192.168.1.1",
        sessionId: "sess_abcd1234efgh5678"
      },

      createdAt: ISODate("2024-09-14T10:30:00Z"),
      updatedAt: ISODate("2024-09-14T11:15:00Z")
    };
  }

  // Pattern 4: Polymorphic Pattern
  // Use when: Similar documents have different structures based on type
  createNotificationPolymorphicPattern() {
    const notifications = [
      // Email notification type
      {
        _id: ObjectId("64a1b2c3d4e5f6789012351a"),
        type: "email",
        userId: ObjectId("64a1b2c3d4e5f6789012347a"),

        // Common notification fields
        title: "Welcome to our platform!",
        priority: "normal",
        status: "sent",
        createdAt: ISODate("2024-09-14T10:00:00Z"),

        // Email-specific fields
        emailData: {
          from: "noreply@example.com",
          to: "sarah@example.com",
          subject: "Welcome to our platform!",
          templateId: "welcome_email_v2",
          templateVariables: {
            firstName: "Sarah",
            activationLink: "https://example.com/activate/abc123"
          },
          deliveryAttempts: 1,
          deliveredAt: ISODate("2024-09-14T10:01:30Z"),
          openedAt: ISODate("2024-09-14T10:15:22Z"),
          clickedAt: ISODate("2024-09-14T10:16:10Z")
        }
      },

      // Push notification type
      {
        _id: ObjectId("64a1b2c3d4e5f6789012351b"),
        type: "push",
        userId: ObjectId("64a1b2c3d4e5f6789012347a"),

        // Common notification fields
        title: "Your order has shipped!",
        priority: "high",
        status: "delivered",
        createdAt: ISODate("2024-09-14T14:30:00Z"),

        // Push-specific fields
        pushData: {
          deviceTokens: [
            "device_token_1234567890abcdef",
            "device_token_abcdef1234567890"
          ],
          payload: {
            alert: {
              title: "Order Shipped",
              body: "Your MacBook Pro is on the way! Track: 1234567890123456"
            },
            badge: 1,
            sound: "default",
            category: "order_update",
            customData: {
              orderId: "ORD-2024-091401",
              trackingNumber: "1234567890123456",
              deepLink: "app://orders/ORD-2024-091401"
            }
          },
          deliveryResults: [
            {
              deviceToken: "device_token_1234567890abcdef",
              status: "delivered",
              deliveredAt: ISODate("2024-09-14T14:31:15Z")
            },
            {
              deviceToken: "device_token_abcdef1234567890", 
              status: "failed",
              error: "invalid_token",
              attemptedAt: ISODate("2024-09-14T14:31:15Z")
            }
          ]
        }
      },

      // SMS notification type
      {
        _id: ObjectId("64a1b2c3d4e5f6789012351c"),
        type: "sms",
        userId: ObjectId("64a1b2c3d4e5f6789012347a"),

        // Common notification fields
        title: "Security Alert",
        priority: "urgent",
        status: "sent",
        createdAt: ISODate("2024-09-14T16:45:00Z"),

        // SMS-specific fields
        smsData: {
          to: "+15551234567",
          from: "+15559876543",
          message: "Security Alert: New login detected from San Francisco, CA. If this wasn't you, secure your account immediately.",
          provider: "twilio",
          messageId: "SMabcdef1234567890",
          segments: 1,
          cost: 0.0075,
          deliveredAt: ISODate("2024-09-14T16:45:12Z"),
          deliveryStatus: "delivered"
        }
      }
    ];

    return notifications;
  }

  // Pattern 5: Bucket Pattern
  // Use when: Time-series data or high-volume data needs grouping
  createMetricsBucketPattern() {
    // Group metrics by hour to reduce document count
    return {
      _id: ObjectId("64a1b2c3d4e5f6789012352a"),

      // Bucket identifier
      type: "user_activity_metrics",
      userId: ObjectId("64a1b2c3d4e5f6789012347a"),

      // Time bucket information
      bucketDate: ISODate("2024-09-14T10:00:00Z"), // Hour bucket start
      bucketSize: "hourly",

      // Metadata for the bucket
      metadata: {
        userName: "sarah_dev",
        userSegment: "premium",
        deviceType: "desktop",
        location: "San Francisco, CA"
      },

      // Count of events in this bucket
      eventCount: 45,

      // Array of individual events within the time bucket
      events: [
        {
          timestamp: ISODate("2024-09-14T10:05:23Z"),
          eventType: "page_view",
          page: "/dashboard",
          sessionId: "sess_abc123",
          loadTime: 1250,
          userAgent: "Mozilla/5.0..."
        },
        {
          timestamp: ISODate("2024-09-14T10:07:45Z"),
          eventType: "click",
          element: "export_button",
          page: "/reports",
          sessionId: "sess_abc123"
        },
        {
          timestamp: ISODate("2024-09-14T10:12:10Z"),
          eventType: "api_call",
          endpoint: "/api/v1/reports/generate",
          responseTime: 2340,
          statusCode: 200,
          sessionId: "sess_abc123"
        }
        // ... more events up to reasonable bucket size (e.g., 100-1000 events)
      ],

      // Pre-aggregated summary statistics for the bucket
      summary: {
        pageViews: 15,
        clicks: 8,
        apiCalls: 12,
        errors: 2,
        uniquePages: 6,
        totalLoadTime: 18750,
        avgLoadTime: 1250,
        maxLoadTime: 3200,
        minLoadTime: 450,
        totalSessionTime: 1800000 // 30 minutes
      },

      // Bucket management
      bucketMetadata: {
        isFull: false,
        maxEvents: 1000,
        createdAt: ISODate("2024-09-14T10:05:23Z"),
        lastUpdated: ISODate("2024-09-14T10:59:45Z"),
        nextBucketId: null // Set when bucket is full
      }
    };
  }

  // Pattern 6: Attribute Pattern  
  // Use when: Documents have many similar fields or sparse attributes
  createProductAttributePattern() {
    return {
      _id: ObjectId("64a1b2c3d4e5f6789012353a"),
      productName: "Gaming Desktop Computer",
      category: "Electronics",

      // Attribute pattern for flexible, searchable specifications
      attributes: [
        {
          key: "processor",
          value: "Intel Core i9-13900K",
          type: "string",
          unit: null,
          isSearchable: true,
          isFilterable: true,
          displayOrder: 1,
          category: "performance"
        },
        {
          key: "ram",
          value: 32,
          type: "number",
          unit: "GB",
          isSearchable: true,
          isFilterable: true,
          displayOrder: 2,
          category: "performance"
        },
        {
          key: "storage",
          value: "1TB NVMe SSD + 2TB HDD",
          type: "string", 
          unit: null,
          isSearchable: true,
          isFilterable: false,
          displayOrder: 3,
          category: "storage"
        },
        {
          key: "graphics_card",
          value: "NVIDIA GeForce RTX 4080",
          type: "string",
          unit: null,
          isSearchable: true,
          isFilterable: true,
          displayOrder: 4,
          category: "performance"
        },
        {
          key: "power_consumption",
          value: 750,
          type: "number",
          unit: "watts",
          isSearchable: false,
          isFilterable: true,
          displayOrder: 10,
          category: "specifications"
        },
        {
          key: "warranty_years",
          value: 3,
          type: "number", 
          unit: "years",
          isSearchable: false,
          isFilterable: true,
          displayOrder: 15,
          category: "warranty"
        },
        {
          key: "rgb_lighting",
          value: true,
          type: "boolean",
          unit: null,
          isSearchable: false,
          isFilterable: true,
          displayOrder: 20,
          category: "aesthetics"
        }
      ],

      // Pre-computed attribute indexes for faster queries
      attributeIndex: {
        // String attributes for text search
        stringAttributes: {
          "processor": "Intel Core i9-13900K",
          "storage": "1TB NVMe SSD + 2TB HDD",
          "graphics_card": "NVIDIA GeForce RTX 4080"
        },

        // Numeric attributes for range queries
        numericAttributes: {
          "ram": 32,
          "power_consumption": 750,
          "warranty_years": 3
        },

        // Boolean attributes for exact matching
        booleanAttributes: {
          "rgb_lighting": true
        },

        // Searchable attribute values for text search
        searchableValues: [
          "Intel Core i9-13900K",
          "1TB NVMe SSD + 2TB HDD", 
          "NVIDIA GeForce RTX 4080"
        ],

        // Filterable attributes for faceted search
        filterableAttributes: [
          "processor", "ram", "graphics_card", 
          "power_consumption", "warranty_years", "rgb_lighting"
        ]
      },

      createdAt: ISODate("2024-09-14"),
      updatedAt: ISODate("2024-09-14")
    };
  }

  // Pattern 7: Computed Pattern
  // Use when: Expensive calculations need to be pre-computed and stored
  createUserAnalyticsComputedPattern() {
    return {
      _id: ObjectId("64a1b2c3d4e5f6789012354a"),
      userId: ObjectId("64a1b2c3d4e5f6789012347a"),

      // Computed metrics updated periodically
      computedMetrics: {
        // User engagement metrics
        engagement: {
          totalSessions: 342,
          totalSessionTime: 45600000, // milliseconds
          avgSessionDuration: 133333, // milliseconds (4.5 minutes)
          lastActiveDate: ISODate("2024-09-14"),
          daysSinceLastActive: 0,

          // Activity patterns
          mostActiveHour: 14, // 2 PM
          mostActiveDay: "tuesday",
          peakActivityScore: 8.7,

          // Engagement trends (last 30 days)
          dailyAverages: {
            sessions: 11.4,
            sessionTime: 1520000, // milliseconds
            pageViews: 23.7
          }
        },

        // Purchase behavior analytics
        purchasing: {
          totalOrders: 23,
          totalSpent: 12485.67,
          avgOrderValue: 543.29,
          daysSinceLastPurchase: 12,

          // Purchase patterns
          preferredCategories: [
            { category: "Electronics", orderCount: 12, totalSpent: 8234.50 },
            { category: "Books", orderCount: 8, totalSpent: 2145.32 },
            { category: "Clothing", orderCount: 3, totalSpent: 2105.85 }
          ],

          // Customer lifecycle metrics  
          lifetimeValue: 12485.67,
          predictedLifetimeValue: 24750.00,
          churnProbability: 0.15,
          nextPurchasePrediction: ISODate("2024-09-28"),

          // RFM scores
          rfmScores: {
            recency: 4, // Recent purchase
            frequency: 3, // Moderate purchase frequency
            monetary: 5, // High spending
            combined: "435",
            segment: "Loyal Customer"
          }
        },

        // Content interaction metrics
        contentEngagement: {
          articlesRead: 45,
          videosWatched: 23,
          totalReadingTime: 54000000, // milliseconds (15 hours)
          avgReadingSpeed: 250, // words per minute

          // Content preferences
          preferredTopics: [
            { topic: "Technology", interactionScore: 9.2, articles: 18 },
            { topic: "Programming", interactionScore: 8.8, articles: 15 },
            { topic: "Career", interactionScore: 7.5, articles: 12 }
          ],

          // Engagement quality
          completionRate: 0.78, // 78% of articles read to completion
          shareRate: 0.12, // 12% of articles shared
          bookmarkRate: 0.25 // 25% of articles bookmarked
        },

        // Social interaction metrics
        socialMetrics: {
          connectionsCount: 156,
          followersCount: 234,
          followingCount: 189,

          // Interaction patterns
          postsCreated: 67,
          commentsPosted: 234,
          likesGiven: 1567,
          sharesGiven: 89,

          // Influence metrics
          avgLikesPerPost: 12.4,
          avgCommentsPerPost: 3.8,
          influenceScore: 7.3,
          engagementRate: 0.065 // 6.5%
        }
      },

      // Computation metadata
      computationMetadata: {
        lastComputedAt: ISODate("2024-09-14T06:00:00Z"),
        nextComputationAt: ISODate("2024-09-15T06:00:00Z"),
        computationFrequency: "daily",
        computationDuration: 2340, // milliseconds
        dataFreshness: "6_hours", // Data is 6 hours old

        // Data sources used in computation
        dataSources: [
          {
            collection: "user_sessions",
            lastProcessedRecord: ISODate("2024-09-14T00:00:00Z"),
            recordsProcessed: 342
          },
          {
            collection: "orders",
            lastProcessedRecord: ISODate("2024-09-13T23:59:59Z"),
            recordsProcessed: 23
          },
          {
            collection: "content_interactions", 
            lastProcessedRecord: ISODate("2024-09-14T00:00:00Z"),
            recordsProcessed: 1456
          }
        ],

        // Computation version for tracking changes
        version: "2.1.0",
        algorithmVersion: "analytics_v2_1"
      },

      createdAt: ISODate("2024-01-15"),
      updatedAt: ISODate("2024-09-14T06:00:00Z")
    };
  }

  // Method to choose optimal pattern based on use case
  recommendDataPattern(useCase) {
    const recommendations = {
      "user_profile": {
        pattern: "embedded",
        reason: "Related data accessed together, relatively small size",
        example: "createUserProfileEmbeddedPattern()"
      },
      "blog_system": {
        pattern: "reference",
        reason: "Large documents, many-to-many relationships, separate lifecycle",
        example: "createBlogPostReferencePattern()"
      },
      "ecommerce_order": {
        pattern: "hybrid",
        reason: "Need historical snapshots and current references",
        example: "createOrderHybridPattern()"
      },
      "notification_system": {
        pattern: "polymorphic", 
        reason: "Different document structures based on notification type",
        example: "createNotificationPolymorphicPattern()"
      },
      "time_series_data": {
        pattern: "bucket",
        reason: "High-volume data with time-based grouping",
        example: "createMetricsBucketPattern()"
      },
      "product_catalog": {
        pattern: "attribute",
        reason: "Flexible attributes with search and filtering needs",
        example: "createProductAttributePattern()"
      },
      "user_analytics": {
        pattern: "computed",
        reason: "Expensive calculations need pre-computation",
        example: "createUserAnalyticsComputedPattern()"
      }
    };

    return recommendations[useCase] || {
      pattern: "hybrid",
      reason: "Consider combining patterns based on specific requirements",
      example: "Analyze access patterns and choose appropriate combination"
    };
  }
}

Schema Design and Migration Strategies

Implement effective schema evolution and migration patterns:

// Advanced schema design and migration strategies
class MongoSchemaManager {
  constructor(db) {
    this.db = db;
    this.schemaVersions = new Map();
    this.migrationHistory = [];
  }

  async createSchemaVersioningSystem(collection) {
    // Schema versioning pattern for gradual migrations
    const schemaVersionedDocument = {
      _id: ObjectId("64a1b2c3d4e5f6789012355a"),

      // Schema version metadata
      _schema: {
        version: "2.1.0",
        createdAt: ISODate("2024-09-14"),
        lastMigrated: ISODate("2024-09-14T08:30:00Z"),
        migrationHistory: [
          {
            fromVersion: "1.0.0",
            toVersion: "2.0.0",
            migratedAt: ISODate("2024-08-15T10:00:00Z"),
            migrationId: "migration_20240815_v2",
            changes: ["Added user preferences", "Restructured contact methods"]
          },
          {
            fromVersion: "2.0.0",
            toVersion: "2.1.0",
            migratedAt: ISODate("2024-09-14T08:30:00Z"),
            migrationId: "migration_20240914_v21",
            changes: ["Added analytics tracking", "Enhanced profile structure"]
          }
        ]
      },

      // Document data with current schema structure
      username: "sarah_dev",
      email: "sarah@example.com",
      profile: {
        firstName: "Sarah",
        lastName: "Johnson",
        // ... rest of profile data
      },

      // Optional: Keep old field names for backward compatibility during transition
      _deprecated: {
        // Old structure maintained during migration period
        full_name: "Sarah Johnson", // Deprecated in v2.0.0
        user_preferences: { /* old structure */ }, // Deprecated in v2.1.0
        deprecatedFields: ["full_name", "user_preferences"],
        removalScheduled: ISODate("2024-12-01") // When to remove deprecated fields
      },

      createdAt: ISODate("2024-01-15"),
      updatedAt: ISODate("2024-09-14")
    };

    return schemaVersionedDocument;
  }

  async performGradualMigration(collection, fromVersion, toVersion, migrationConfig) {
    // Gradual migration strategy to avoid downtime
    const migrationPlan = {
      migrationId: `migration_${Date.now()}`,
      fromVersion: fromVersion,
      toVersion: toVersion,
      startedAt: new Date(),

      // Migration phases
      phases: [
        {
          phase: 1,
          name: "preparation",
          description: "Create indexes and validate migration logic",
          status: "pending"
        },
        {
          phase: 2,
          name: "gradual_migration", 
          description: "Migrate documents in batches",
          batchSize: migrationConfig.batchSize || 1000,
          status: "pending"
        },
        {
          phase: 3,
          name: "validation",
          description: "Validate migrated data integrity",
          status: "pending"
        },
        {
          phase: 4,
          name: "cleanup",
          description: "Remove deprecated fields and indexes",
          status: "pending"
        }
      ]
    };

    try {
      // Phase 1: Preparation
      console.log("Phase 1: Preparing migration...");
      migrationPlan.phases[0].status = "in_progress";

      // Create necessary indexes for migration
      if (migrationConfig.newIndexes) {
        for (const index of migrationConfig.newIndexes) {
          await this.db.collection(collection).createIndex(index.fields, index.options);
          console.log(`Created index: ${JSON.stringify(index.fields)}`);
        }
      }

      migrationPlan.phases[0].status = "completed";
      migrationPlan.phases[0].completedAt = new Date();

      // Phase 2: Gradual migration in batches
      console.log("Phase 2: Starting gradual migration...");
      migrationPlan.phases[1].status = "in_progress";
      migrationPlan.phases[1].startedAt = new Date();

      let totalProcessed = 0;
      let batchNumber = 0;

      while (true) {
        batchNumber++;

        // Find documents that need migration
        const documentsToMigrate = await this.db.collection(collection).find({
          "_schema.version": { $ne: toVersion },
          "_migrationLock": { $exists: false } // Avoid concurrent migration
        })
        .limit(migrationConfig.batchSize || 1000)
        .toArray();

        if (documentsToMigrate.length === 0) {
          break; // No more documents to migrate
        }

        console.log(`Processing batch ${batchNumber}: ${documentsToMigrate.length} documents`);

        // Process batch with write concern for durability
        const bulkOperations = [];

        for (const doc of documentsToMigrate) {
          // Set migration lock to prevent concurrent updates
          await this.db.collection(collection).updateOne(
            { _id: doc._id },
            { $set: { "_migrationLock": true } }
          );

          try {
            // Apply migration transformation
            const migratedDoc = await this.applyMigrationTransformation(doc, fromVersion, toVersion);

            bulkOperations.push({
              updateOne: {
                filter: { _id: doc._id },
                update: {
                  $set: migratedDoc,
                  $unset: { "_migrationLock": 1 },
                  $push: {
                    "_schema.migrationHistory": {
                      fromVersion: fromVersion,
                      toVersion: toVersion,
                      migratedAt: new Date(),
                      migrationId: migrationPlan.migrationId
                    }
                  }
                }
              }
            });

          } catch (error) {
            console.error(`Migration failed for document ${doc._id}:`, error);

            // Remove migration lock on failure
            await this.db.collection(collection).updateOne(
              { _id: doc._id },
              { $unset: { "_migrationLock": 1 } }
            );
          }
        }

        // Execute bulk operations
        if (bulkOperations.length > 0) {
          const result = await this.db.collection(collection).bulkWrite(bulkOperations, {
            writeConcern: { w: "majority" }
          });

          totalProcessed += result.modifiedCount;
          console.log(`Batch ${batchNumber} completed: ${result.modifiedCount} documents migrated`);
        }

        // Add delay between batches to reduce system load
        if (migrationConfig.batchDelayMs) {
          await new Promise(resolve => setTimeout(resolve, migrationConfig.batchDelayMs));
        }
      }

      migrationPlan.phases[1].status = "completed";
      migrationPlan.phases[1].completedAt = new Date();
      migrationPlan.phases[1].documentsProcessed = totalProcessed;

      // Phase 3: Validation
      console.log("Phase 3: Validating migration...");
      migrationPlan.phases[2].status = "in_progress";

      const validationResult = await this.validateMigration(collection, toVersion);

      if (validationResult.success) {
        migrationPlan.phases[2].status = "completed";
        migrationPlan.phases[2].validationResult = validationResult;
        console.log("Migration validation successful");
      } else {
        migrationPlan.phases[2].status = "failed";
        migrationPlan.phases[2].validationResult = validationResult;
        throw new Error(`Migration validation failed: ${validationResult.errors.join(", ")}`);
      }

      // Phase 4: Cleanup (optional, scheduled for later)
      if (migrationConfig.immediateCleanup) {
        console.log("Phase 4: Cleanup...");
        migrationPlan.phases[3].status = "in_progress";

        await this.cleanupDeprecatedFields(collection, migrationConfig.fieldsToRemove);

        migrationPlan.phases[3].status = "completed";
        migrationPlan.phases[3].completedAt = new Date();
      } else {
        migrationPlan.phases[3].status = "scheduled";
        migrationPlan.phases[3].scheduledFor = migrationConfig.cleanupScheduledFor;
      }

      migrationPlan.status = "completed";
      migrationPlan.completedAt = new Date();

      // Record migration in history
      this.migrationHistory.push(migrationPlan);

      return migrationPlan;

    } catch (error) {
      migrationPlan.status = "failed";
      migrationPlan.error = error.message;
      migrationPlan.failedAt = new Date();

      console.error("Migration failed:", error);

      // Attempt to clean up any migration locks
      await this.db.collection(collection).updateMany(
        { "_migrationLock": true },
        { $unset: { "_migrationLock": 1 } }
      );

      throw error;
    }
  }

  async applyMigrationTransformation(document, fromVersion, toVersion) {
    // Apply specific transformation based on version upgrade path
    const transformations = {
      "1.0.0_to_2.0.0": (doc) => {
        // Example: Restructure user contact information
        if (doc.full_name && !doc.profile) {
          const nameParts = doc.full_name.split(" ");
          doc.profile = {
            firstName: nameParts[0] || "",
            lastName: nameParts.slice(1).join(" ") || ""
          };

          // Mark old field as deprecated but keep for backward compatibility
          doc._deprecated = doc._deprecated || {};
          doc._deprecated.full_name = doc.full_name;
          delete doc.full_name;
        }

        // Update schema version
        doc._schema = doc._schema || {};
        doc._schema.version = "2.0.0";
        doc._schema.lastMigrated = new Date();

        return doc;
      },

      "2.0.0_to_2.1.0": (doc) => {
        // Example: Add analytics tracking structure
        if (!doc.analytics) {
          doc.analytics = {
            totalLogins: 0,
            lastLoginAt: null,
            createdAt: doc.createdAt,
            engagement: {
              level: "new",
              score: 0
            }
          };
        }

        // Migrate user preferences structure
        if (doc.user_preferences && !doc.preferences) {
          doc.preferences = {
            theme: doc.user_preferences.theme || "light",
            language: doc.user_preferences.lang || "en",
            notifications: doc.user_preferences.notifications || {}
          };

          // Mark old field as deprecated
          doc._deprecated = doc._deprecated || {};
          doc._deprecated.user_preferences = doc.user_preferences;
          delete doc.user_preferences;
        }

        // Update schema version
        doc._schema.version = "2.1.0";
        doc._schema.lastMigrated = new Date();

        return doc;
      }
    };

    const transformationKey = `${fromVersion}_to_${toVersion}`;
    const transformation = transformations[transformationKey];

    if (!transformation) {
      throw new Error(`No transformation defined for ${transformationKey}`);
    }

    return transformation({ ...document }); // Work with copy to avoid mutations
  }

  async validateMigration(collection, expectedVersion) {
    const validationResult = {
      success: true,
      errors: [],
      warnings: [],
      statistics: {}
    };

    try {
      // Check all documents have the correct schema version
      const totalDocuments = await this.db.collection(collection).countDocuments({});
      const migratedDocuments = await this.db.collection(collection).countDocuments({
        "_schema.version": expectedVersion
      });

      validationResult.statistics.totalDocuments = totalDocuments;
      validationResult.statistics.migratedDocuments = migratedDocuments;
      validationResult.statistics.migrationCompleteness = migratedDocuments / totalDocuments;

      if (migratedDocuments !== totalDocuments) {
        validationResult.errors.push(
          `Migration incomplete: ${migratedDocuments}/${totalDocuments} documents migrated`
        );
        validationResult.success = false;
      }

      // Check for migration locks (indicates failed migrations)
      const lockedDocuments = await this.db.collection(collection).countDocuments({
        "_migrationLock": true
      });

      if (lockedDocuments > 0) {
        validationResult.warnings.push(
          `${lockedDocuments} documents have migration locks - may indicate failed migrations`
        );
      }

      // Validate sample documents have expected structure
      const sampleSize = Math.min(100, migratedDocuments);
      const sampleDocuments = await this.db.collection(collection).aggregate([
        { $match: { "_schema.version": expectedVersion } },
        { $sample: { size: sampleSize } }
      ]).toArray();

      let structureValidationErrors = 0;

      for (const doc of sampleDocuments) {
        try {
          await this.validateDocumentStructure(doc, expectedVersion);
        } catch (error) {
          structureValidationErrors++;
        }
      }

      if (structureValidationErrors > 0) {
        validationResult.errors.push(
          `${structureValidationErrors}/${sampleSize} sample documents have structure validation errors`
        );
        validationResult.success = false;
      }

      validationResult.statistics.sampleSize = sampleSize;
      validationResult.statistics.structureValidationErrors = structureValidationErrors;

    } catch (error) {
      validationResult.success = false;
      validationResult.errors.push(`Validation error: ${error.message}`);
    }

    return validationResult;
  }

  async validateDocumentStructure(document, schemaVersion) {
    // Define expected structure for each schema version
    const schemaValidators = {
      "2.1.0": (doc) => {
        // Required fields for version 2.1.0
        const requiredFields = ["_schema", "username", "email", "profile", "createdAt"];

        for (const field of requiredFields) {
          if (!doc.hasOwnProperty(field)) {
            throw new Error(`Missing required field: ${field}`);
          }
        }

        // Validate _schema structure
        if (!doc._schema.version || !doc._schema.lastMigrated) {
          throw new Error("Invalid _schema structure");
        }

        // Validate profile structure
        if (!doc.profile.firstName || !doc.profile.lastName) {
          throw new Error("Invalid profile structure");
        }

        return true;
      }
    };

    const validator = schemaValidators[schemaVersion];
    if (!validator) {
      throw new Error(`No validator defined for schema version ${schemaVersion}`);
    }

    return validator(document);
  }

  async cleanupDeprecatedFields(collection, fieldsToRemove) {
    // Remove deprecated fields after successful migration
    console.log(`Cleaning up deprecated fields: ${fieldsToRemove.join(", ")}`);

    const unsetFields = fieldsToRemove.reduce((acc, field) => {
      acc[field] = 1;
      acc[`_deprecated.${field}`] = 1;
      return acc;
    }, {});

    const result = await this.db.collection(collection).updateMany(
      {}, // Update all documents
      {
        $unset: unsetFields,
        $set: {
          "cleanupCompletedAt": new Date()
        }
      }
    );

    console.log(`Cleanup completed: ${result.modifiedCount} documents updated`);
    return result;
  }

  async createSchemaValidationRules(collection, schemaVersion) {
    // Create MongoDB schema validation rules
    const validationRules = {
      "2.1.0": {
        $jsonSchema: {
          bsonType: "object",
          required: ["_schema", "username", "email", "profile", "createdAt"],
          properties: {
            _schema: {
              bsonType: "object",
              required: ["version"],
              properties: {
                version: {
                  bsonType: "string",
                  enum: ["2.1.0"]
                },
                lastMigrated: {
                  bsonType: "date"
                }
              }
            },
            username: {
              bsonType: "string",
              minLength: 3,
              maxLength: 30
            },
            email: {
              bsonType: "string",
              pattern: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
            },
            profile: {
              bsonType: "object",
              required: ["firstName", "lastName"],
              properties: {
                firstName: { bsonType: "string", maxLength: 50 },
                lastName: { bsonType: "string", maxLength: 50 },
                dateOfBirth: { bsonType: "date" },
                avatar: {
                  bsonType: "object",
                  properties: {
                    url: { bsonType: "string" },
                    uploadedAt: { bsonType: "date" }
                  }
                }
              }
            },
            createdAt: { bsonType: "date" },
            updatedAt: { bsonType: "date" }
          }
        }
      }
    };

    const rule = validationRules[schemaVersion];
    if (!rule) {
      throw new Error(`No validation rule defined for schema version ${schemaVersion}`);
    }

    // Apply validation rule to collection
    await this.db.command({
      collMod: collection,
      validator: rule,
      validationLevel: "moderate", // Only validate inserts and updates to valid documents
      validationAction: "warn" // Log validation errors but allow operations
    });

    console.log(`Schema validation rules applied to ${collection} for version ${schemaVersion}`);
    return rule;
  }

  async getMigrationStatus(collection) {
    // Get comprehensive migration status for a collection
    const status = {
      collection: collection,
      currentTime: new Date(),
      schemaVersions: {},
      totalDocuments: 0,
      migrationLocks: 0,
      deprecatedFields: [],
      recentMigrations: []
    };

    // Count documents by schema version
    const versionCounts = await this.db.collection(collection).aggregate([
      {
        $group: {
          _id: "$_schema.version",
          count: { $sum: 1 },
          lastMigrated: { $max: "$_schema.lastMigrated" }
        }
      },
      { $sort: { "_id": 1 } }
    ]).toArray();

    versionCounts.forEach(version => {
      status.schemaVersions[version._id || "unknown"] = {
        count: version.count,
        lastMigrated: version.lastMigrated
      };
      status.totalDocuments += version.count;
    });

    // Count migration locks
    status.migrationLocks = await this.db.collection(collection).countDocuments({
      "_migrationLock": true
    });

    // Find documents with deprecated fields
    const deprecatedFieldsAnalysis = await this.db.collection(collection).aggregate([
      { $match: { "_deprecated": { $exists: true } } },
      {
        $project: {
          deprecatedFields: { $objectToArray: "$_deprecated" }
        }
      },
      { $unwind: "$deprecatedFields" },
      {
        $group: {
          _id: "$deprecatedFields.k",
          count: { $sum: 1 }
        }
      }
    ]).toArray();

    status.deprecatedFields = deprecatedFieldsAnalysis.map(field => ({
      fieldName: field._id,
      documentCount: field.count
    }));

    // Get recent migration history
    status.recentMigrations = this.migrationHistory
      .filter(migration => migration.collection === collection)
      .slice(-5) // Last 5 migrations
      .map(migration => ({
        migrationId: migration.migrationId,
        fromVersion: migration.fromVersion,
        toVersion: migration.toVersion,
        status: migration.status,
        completedAt: migration.completedAt,
        documentsProcessed: migration.phases[1]?.documentsProcessed
      }));

    return status;
  }
}

SQL-Style Data Modeling with QueryLeaf

QueryLeaf provides familiar SQL approaches to MongoDB data modeling and schema design:

-- QueryLeaf data modeling with SQL-familiar schema design syntax

-- Define document structure similar to CREATE TABLE
CREATE DOCUMENT_SCHEMA users (
  _id OBJECTID PRIMARY KEY,
  username VARCHAR(30) NOT NULL UNIQUE,
  email VARCHAR(255) NOT NULL UNIQUE,

  -- Embedded document structure
  profile DOCUMENT {
    firstName VARCHAR(50) NOT NULL,
    lastName VARCHAR(50) NOT NULL,
    dateOfBirth DATE,
    avatar DOCUMENT {
      url VARCHAR(500),
      uploadedAt TIMESTAMP,
      size DOCUMENT {
        width INTEGER,
        height INTEGER
      }
    },
    bio TEXT,
    location DOCUMENT {
      city VARCHAR(100),
      state VARCHAR(50),
      country VARCHAR(100),
      timezone VARCHAR(50)
    }
  },

  -- Array of embedded documents
  contactMethods ARRAY OF DOCUMENT {
    type ENUM('email', 'phone', 'address'),
    value VARCHAR(255) NOT NULL,
    isPrimary BOOLEAN DEFAULT false,
    isVerified BOOLEAN DEFAULT false,
    verifiedAt TIMESTAMP
  },

  -- Array of simple values with constraints
  skills ARRAY OF DOCUMENT {
    name VARCHAR(100) NOT NULL,
    level ENUM('beginner', 'intermediate', 'advanced', 'expert'),
    yearsExperience INTEGER CHECK (yearsExperience >= 0)
  },

  -- Reference to another collection
  departmentId OBJECTID REFERENCES departments(_id),

  -- Embedded metadata
  account DOCUMENT {
    status ENUM('active', 'inactive', 'suspended') DEFAULT 'active',
    type ENUM('free', 'premium', 'enterprise') DEFAULT 'free',
    createdAt TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    lastLoginAt TIMESTAMP,
    loginCount INTEGER DEFAULT 0,
    isEmailVerified BOOLEAN DEFAULT false,
    twoFactorEnabled BOOLEAN DEFAULT false
  },

  -- Flexible attributes using attribute pattern
  attributes ARRAY OF DOCUMENT {
    key VARCHAR(100) NOT NULL,
    value MIXED, -- Can be string, number, boolean, etc.
    type ENUM('string', 'number', 'boolean', 'date'),
    isSearchable BOOLEAN DEFAULT false,
    isFilterable BOOLEAN DEFAULT false,
    category VARCHAR(50)
  },

  -- Timestamps for auditing
  createdAt TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  updatedAt TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create indexes for optimal query performance
CREATE INDEX idx_users_username ON users (username);
CREATE INDEX idx_users_email ON users (email);
CREATE INDEX idx_users_profile_name ON users (profile.firstName, profile.lastName);
CREATE INDEX idx_users_skills ON users (skills.name, skills.level);
CREATE INDEX idx_users_location ON users (profile.location.city, profile.location.state);

-- Compound index for complex queries
CREATE INDEX idx_users_active_premium ON users (account.status, account.type, createdAt);

-- Text index for full-text search
CREATE TEXT INDEX idx_users_search ON users (
  username,
  profile.firstName,
  profile.lastName,
  profile.bio,
  skills.name
);

-- Schema versioning and migration management
ALTER DOCUMENT_SCHEMA users ADD COLUMN analytics DOCUMENT {
  totalLogins INTEGER DEFAULT 0,
  lastLoginAt TIMESTAMP,
  engagement DOCUMENT {
    level ENUM('new', 'active', 'power', 'inactive') DEFAULT 'new',
    score DECIMAL(3,2) DEFAULT 0.00
  }
} WITH MIGRATION_STRATEGY gradual;

-- Polymorphic document schema for notifications
CREATE DOCUMENT_SCHEMA notifications (
  _id OBJECTID PRIMARY KEY,
  userId OBJECTID NOT NULL REFERENCES users(_id),
  type ENUM('email', 'push', 'sms') NOT NULL,

  -- Common fields for all notification types
  title VARCHAR(200) NOT NULL,
  priority ENUM('low', 'normal', 'high', 'urgent') DEFAULT 'normal',
  status ENUM('pending', 'sent', 'delivered', 'failed') DEFAULT 'pending',

  -- Polymorphic data based on type using VARIANT
  notificationData VARIANT {
    WHEN type = 'email' THEN DOCUMENT {
      from VARCHAR(255) NOT NULL,
      to VARCHAR(255) NOT NULL,
      subject VARCHAR(500) NOT NULL,
      templateId VARCHAR(100),
      templateVariables DOCUMENT,
      deliveryAttempts INTEGER DEFAULT 0,
      deliveredAt TIMESTAMP,
      openedAt TIMESTAMP,
      clickedAt TIMESTAMP
    },

    WHEN type = 'push' THEN DOCUMENT {
      deviceTokens ARRAY OF VARCHAR(255),
      payload DOCUMENT {
        alert DOCUMENT {
          title VARCHAR(200),
          body VARCHAR(500)
        },
        badge INTEGER,
        sound VARCHAR(50),
        category VARCHAR(100),
        customData DOCUMENT
      },
      deliveryResults ARRAY OF DOCUMENT {
        deviceToken VARCHAR(255),
        status ENUM('delivered', 'failed'),
        error VARCHAR(255),
        timestamp TIMESTAMP
      }
    },

    WHEN type = 'sms' THEN DOCUMENT {
      to VARCHAR(20) NOT NULL,
      from VARCHAR(20),
      message VARCHAR(1600) NOT NULL,
      provider VARCHAR(50),
      messageId VARCHAR(255),
      segments INTEGER DEFAULT 1,
      cost DECIMAL(6,4),
      deliveredAt TIMESTAMP,
      deliveryStatus VARCHAR(50)
    }
  },

  createdAt TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  updatedAt TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Bucket pattern for time-series metrics
CREATE DOCUMENT_SCHEMA user_activity_buckets (
  _id OBJECTID PRIMARY KEY,

  -- Bucket identification
  userId OBJECTID NOT NULL REFERENCES users(_id),
  bucketDate TIMESTAMP NOT NULL, -- Hour/day bucket start time
  bucketType ENUM('hourly', 'daily') NOT NULL,

  -- Bucket metadata
  metadata DOCUMENT {
    userName VARCHAR(30),
    userSegment VARCHAR(50),
    deviceType VARCHAR(50),
    location VARCHAR(100)
  },

  -- Event counter
  eventCount INTEGER DEFAULT 0,

  -- Array of events within the bucket
  events ARRAY OF DOCUMENT {
    timestamp TIMESTAMP NOT NULL,
    eventType ENUM('page_view', 'click', 'api_call', 'error') NOT NULL,
    page VARCHAR(500),
    element VARCHAR(200),
    sessionId VARCHAR(100),
    responseTime INTEGER,
    statusCode INTEGER,
    userAgent TEXT
  } VALIDATE (ARRAY_LENGTH(events) <= 1000), -- Limit bucket size

  -- Pre-computed summary statistics
  summary DOCUMENT {
    pageViews INTEGER DEFAULT 0,
    clicks INTEGER DEFAULT 0,
    apiCalls INTEGER DEFAULT 0,
    errors INTEGER DEFAULT 0,
    uniquePages INTEGER DEFAULT 0,
    totalResponseTime BIGINT DEFAULT 0,
    avgResponseTime DECIMAL(8,2),
    maxResponseTime INTEGER,
    minResponseTime INTEGER
  },

  -- Bucket management
  bucketMetadata DOCUMENT {
    isFull BOOLEAN DEFAULT false,
    maxEvents INTEGER DEFAULT 1000,
    nextBucketId OBJECTID
  },

  createdAt TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  updatedAt TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Compound index for efficient bucket queries
CREATE INDEX idx_activity_buckets_user_time ON user_activity_buckets (
  userId, bucketType, bucketDate
);

-- Complex analytics queries with document modeling
WITH user_engagement AS (
  SELECT 
    u._id as user_id,
    u.username,
    u.profile.firstName || ' ' || u.profile.lastName as full_name,
    u.account.type as account_type,

    -- Aggregate metrics from activity buckets
    SUM(ab.summary.pageViews) as total_page_views,
    SUM(ab.summary.clicks) as total_clicks,
    AVG(ab.summary.avgResponseTime) as avg_response_time,
    COUNT(DISTINCT ab.bucketDate) as active_days,

    -- Calculate engagement score
    (SUM(ab.summary.pageViews) * 0.1 + 
     SUM(ab.summary.clicks) * 0.3 + 
     COUNT(DISTINCT ab.bucketDate) * 0.6) as engagement_score,

    -- User profile attributes
    ARRAY_AGG(
      CASE WHEN ua.attributes->key = 'department' 
           THEN ua.attributes->value 
      END
    ) FILTER (WHERE ua.attributes->key = 'department') as departments,

    -- Location information
    u.profile.location.city as city,
    u.profile.location.state as state

  FROM users u
  LEFT JOIN user_activity_buckets ab ON u._id = ab.userId
    AND ab.bucketDate >= CURRENT_DATE - INTERVAL '30 days'
  LEFT JOIN UNNEST(u.attributes) as ua ON true

  WHERE u.account.status = 'active'
    AND u.createdAt >= CURRENT_DATE - INTERVAL '1 year'

  GROUP BY u._id, u.username, u.profile.firstName, u.profile.lastName,
           u.account.type, u.profile.location.city, u.profile.location.state
),

engagement_segments AS (
  SELECT *,
    CASE 
      WHEN engagement_score >= 50 THEN 'High Engagement'
      WHEN engagement_score >= 20 THEN 'Medium Engagement' 
      WHEN engagement_score >= 5 THEN 'Low Engagement'
      ELSE 'Inactive'
    END as engagement_segment,

    -- Percentile ranking within account type
    PERCENT_RANK() OVER (
      PARTITION BY account_type 
      ORDER BY engagement_score
    ) as engagement_percentile

  FROM user_engagement
)

SELECT 
  engagement_segment,
  account_type,
  COUNT(*) as user_count,
  AVG(engagement_score) as avg_engagement_score,
  AVG(total_page_views) as avg_page_views,
  AVG(active_days) as avg_active_days,

  -- Top cities by user count in each segment
  ARRAY_AGG(
    JSON_BUILD_OBJECT(
      'city', city,
      'state', state,
      'count', COUNT(*) OVER (PARTITION BY city, state)
    ) ORDER BY COUNT(*) OVER (PARTITION BY city, state) DESC LIMIT 5
  ) as top_locations,

  -- Engagement distribution
  JSON_BUILD_OBJECT(
    'min', MIN(engagement_score),
    'p25', PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY engagement_score),
    'p50', PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY engagement_score),
    'p75', PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY engagement_score),
    'p95', PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY engagement_score),
    'max', MAX(engagement_score)
  ) as engagement_distribution

FROM engagement_segments
GROUP BY engagement_segment, account_type
ORDER BY engagement_segment, account_type;

-- Schema validation and data quality checks
SELECT 
  collection_name,
  schema_version,
  document_count,

  -- Data quality metrics
  (SELECT COUNT(*) FROM users WHERE username IS NULL) as missing_usernames,
  (SELECT COUNT(*) FROM users WHERE email IS NULL) as missing_emails,
  (SELECT COUNT(*) FROM users WHERE profile IS NULL) as missing_profiles,

  -- Schema compliance
  (SELECT COUNT(*) FROM users WHERE _schema.version != '2.1.0') as outdated_schema,
  (SELECT COUNT(*) FROM users WHERE _migrationLock = true) as migration_locks,

  -- Index usage analysis
  JSON_BUILD_OBJECT(
    'username_index_usage', INDEX_USAGE_STATS('users', 'idx_users_username'),
    'email_index_usage', INDEX_USAGE_STATS('users', 'idx_users_email'),
    'profile_name_index_usage', INDEX_USAGE_STATS('users', 'idx_users_profile_name')
  ) as index_statistics,

  -- Storage efficiency metrics
  AVG_DOCUMENT_SIZE('users') as avg_document_size_kb,
  DOCUMENT_SIZE_DISTRIBUTION('users') as size_distribution,

  CURRENT_TIMESTAMP as analysis_timestamp

FROM DOCUMENT_SCHEMA_STATS('users');

-- Migration management with SQL-style syntax
CREATE MIGRATION migrate_users_v2_to_v3 AS
BEGIN
  -- Add new analytics structure
  ALTER DOCUMENT_SCHEMA users 
  ADD COLUMN detailed_analytics DOCUMENT {
    sessions ARRAY OF DOCUMENT {
      sessionId VARCHAR(100),
      startTime TIMESTAMP,
      endTime TIMESTAMP,
      pageViews INTEGER,
      actions ARRAY OF VARCHAR(100)
    },
    preferences DOCUMENT {
      communicationChannels ARRAY OF ENUM('email', 'sms', 'push'),
      contentTopics ARRAY OF VARCHAR(100),
      frequencySettings DOCUMENT {
        marketing ENUM('never', 'weekly', 'monthly'),
        updates ENUM('immediate', 'daily', 'weekly')
      }
    }
  };

  -- Update existing documents with default values
  UPDATE users 
  SET detailed_analytics = {
    sessions: [],
    preferences: {
      communicationChannels: ['email'],
      contentTopics: [],
      frequencySettings: {
        marketing: 'monthly',
        updates: 'weekly'
      }
    }
  }
  WHERE detailed_analytics IS NULL;

  -- Update schema version
  UPDATE users 
  SET 
    _schema.version = '3.0.0',
    _schema.lastMigrated = CURRENT_TIMESTAMP,
    updatedAt = CURRENT_TIMESTAMP;

END;

-- Execute migration with options
EXECUTE MIGRATION migrate_users_v2_to_v3 WITH OPTIONS (
  batch_size = 1000,
  batch_delay_ms = 100,
  validation_sample_size = 50,
  cleanup_schedule = '2024-12-01'
);

-- Monitor migration progress
SELECT 
  migration_name,
  status,
  current_phase,
  documents_processed,
  estimated_completion,
  error_count,
  last_error_message
FROM MIGRATION_STATUS('migrate_users_v2_to_v3');

-- QueryLeaf data modeling provides:
-- 1. SQL-familiar schema definition with document structure support
-- 2. Flexible embedded documents and arrays with validation
-- 3. Polymorphic schemas with variant types based on discriminator fields
-- 4. Advanced indexing strategies for document queries
-- 5. Schema versioning and gradual migration management
-- 6. Data quality validation and compliance checking
-- 7. Storage efficiency analysis and optimization recommendations
-- 8. Integration with MongoDB's native document features
-- 9. SQL-style complex queries across embedded structures
-- 10. Automated migration execution with rollback capabilities

Best Practices for MongoDB Data Modeling

Design Decision Framework

Strategic approach to document design decisions:

Access Pattern Analysis: Design documents based on how data will be queried and updated
Cardinality Considerations: Choose embedding vs. referencing based on relationship cardinality
Data Growth Patterns: Consider how document size and collection size will grow over time
Update Frequency: Factor in how often different parts of documents will be updated
Consistency Requirements: Balance performance with data consistency needs
Query Performance: Optimize document structure for most common query patterns

Performance Optimization Guidelines

Essential practices for high-performance document modeling:

Document Size Management: Keep documents under 16MB limit, optimize for working set
Index Strategy: Create indexes that support your access patterns and query requirements
Denormalization Strategy: Strategic denormalization for read performance vs. update complexity
Array Size Limits: Monitor array growth to prevent performance degradation
Embedding Depth: Limit nesting levels to maintain query performance and readability
Schema Evolution: Plan for schema changes without downtime using versioning strategies

Conclusion

MongoDB data modeling requires a fundamental shift from relational thinking to document-oriented design principles. By understanding when to embed versus reference data, how to structure documents for optimal performance, and how to implement effective schema evolution strategies, you can create database designs that are both flexible and performant.

Key data modeling benefits include:

Flexible Schema Design: Documents can evolve naturally with application requirements
Optimal Performance: Strategic embedding eliminates complex joins for read-heavy workloads
Natural Data Structures: Document structure aligns with object-oriented programming models
Horizontal Scalability: Document design supports sharding and distributed architectures
Rich Data Types: Native support for arrays, nested objects, and complex data structures
Schema Evolution: Gradual migration strategies enable schema changes without downtime

Whether you're building content management systems, e-commerce platforms, real-time analytics applications, or any system requiring flexible data structures, MongoDB's document modeling with QueryLeaf's familiar SQL interface provides the foundation for scalable, maintainable database designs. This combination enables you to leverage advanced NoSQL capabilities while preserving familiar database interaction patterns.

QueryLeaf Integration: QueryLeaf automatically translates SQL-familiar schema definitions into optimal MongoDB document structures while providing familiar syntax for complex document queries, schema evolution, and migration management. Advanced document patterns, validation rules, and performance optimization are seamlessly handled through SQL-style operations, making flexible schema design both powerful and accessible.

The integration of flexible document modeling with SQL-style database operations makes MongoDB an ideal platform for applications requiring both sophisticated data structures and familiar database interaction patterns, ensuring your data models remain both efficient and maintainable as they scale and evolve.

September 13, 2025
22 min read

MongoDB Connection Pooling and Performance Optimization: SQL-Style Database Connection Management for High-Throughput Applications

Modern applications require efficient database connection management to handle hundreds or thousands of concurrent users while maintaining optimal performance and resource utilization. Traditional approaches of creating individual database connections for each request quickly exhaust system resources and create performance bottlenecks that severely impact application scalability and user experience.

MongoDB connection pooling provides sophisticated connection management that maintains a pool of persistent database connections, automatically handling connection lifecycle, load balancing, failover scenarios, and performance optimization. Unlike simple connection-per-request models, connection pooling delivers predictable performance characteristics, efficient resource utilization, and robust error handling for production-scale applications.

The Database Connection Challenge

Traditional database connection approaches create significant scalability and performance issues:

-- Traditional per-request connection approach - inefficient and unscalable
-- Each request creates a new database connection
public class TraditionalDatabaseAccess {
    public List<User> getUsers(String filter) throws SQLException {
        // Create new connection for each request
        Connection conn = DriverManager.getConnection(
            "jdbc:postgresql://localhost:5432/appdb",
            "username", "password"
        );

        try {
            PreparedStatement stmt = conn.prepareStatement(
                "SELECT user_id, username, email, created_at " +
                "FROM users WHERE status = ? ORDER BY created_at DESC LIMIT 100"
            );
            stmt.setString(1, filter);

            ResultSet rs = stmt.executeQuery();
            List<User> users = new ArrayList<>();

            while (rs.next()) {
                users.add(new User(
                    rs.getInt("user_id"),
                    rs.getString("username"), 
                    rs.getString("email"),
                    rs.getTimestamp("created_at")
                ));
            }

            return users;

        } finally {
            // Close connection after each request
            conn.close(); // Expensive cleanup for each request
        }
    }
}

-- Problems with per-request connections:
-- 1. Connection establishment overhead (100-500ms per connection)
-- 2. Resource exhaustion under high concurrent load
-- 3. Database server connection limits exceeded quickly
-- 4. TCP socket exhaustion on application servers
-- 5. Unpredictable performance due to connection timing
-- 6. No connection reuse or optimization
-- 7. Difficult to implement failover and retry logic
-- 8. Memory leaks from improperly closed connections

-- Basic connection pooling attempt - still problematic
public class BasicConnectionPool {
    private static final int MAX_CONNECTIONS = 100;
    private Queue<Connection> availableConnections = new LinkedList<>();
    private Set<Connection> usedConnections = new HashSet<>();

    public Connection getConnection() throws SQLException {
        synchronized (this) {
            if (availableConnections.isEmpty()) {
                if (usedConnections.size() < MAX_CONNECTIONS) {
                    availableConnections.add(createNewConnection());
                } else {
                    throw new SQLException("Connection pool exhausted");
                }
            }

            Connection conn = availableConnections.poll();
            usedConnections.add(conn);
            return conn;
        }
    }

    public void releaseConnection(Connection conn) {
        synchronized (this) {
            usedConnections.remove(conn);
            availableConnections.offer(conn);
        }
    }
}

-- Problems with basic pooling:
-- - No connection validation or health checking
-- - No automatic recovery from stale connections
-- - Poor load balancing across multiple database servers
-- - No monitoring or performance metrics
-- - Synchronization bottlenecks under high concurrency
-- - No graceful handling of connection failures
-- - Fixed pool size regardless of actual demand
-- - No integration with application lifecycle management

MongoDB connection pooling with sophisticated management provides comprehensive solutions:

// MongoDB advanced connection pooling - production-ready performance optimization
const { MongoClient, ServerApiVersion } = require('mongodb');

// Advanced connection pool configuration
const mongoClient = new MongoClient('mongodb://localhost:27017/production_db', {
  // Connection pool settings
  maxPoolSize: 100,          // Maximum connections in pool
  minPoolSize: 5,           // Minimum connections to maintain
  maxIdleTimeMS: 300000,    // Close connections after 5 minutes idle

  // Performance optimization
  maxConnecting: 2,         // Max concurrent connection attempts
  connectTimeoutMS: 10000,  // 10 second connection timeout
  socketTimeoutMS: 30000,   // 30 second socket timeout

  // High availability settings
  serverSelectionTimeoutMS: 5000,  // Server selection timeout
  heartbeatFrequencyMS: 10000,     // Health check frequency
  retryWrites: true,               // Automatic retry for write operations
  retryReads: true,                // Automatic retry for read operations

  // Read preference for load balancing
  readPreference: 'secondaryPreferred',
  readConcern: { level: 'majority' },

  // Write concern for durability
  writeConcern: { 
    w: 'majority', 
    j: true,
    wtimeoutMS: 10000 
  },

  // Compression for network efficiency
  compressors: ['zstd', 'zlib', 'snappy'],

  // Server API version for compatibility
  serverApi: {
    version: ServerApiVersion.v1,
    strict: true,
    deprecationErrors: true
  }
});

// Efficient database operations with connection pooling
async function getUsersWithPooling(filter, limit = 100) {
  try {
    // Connection automatically obtained from pool
    const db = mongoClient.db('production_db');
    const users = await db.collection('users').find({
      status: filter,
      deletedAt: { $exists: false }
    })
    .sort({ createdAt: -1 })
    .limit(limit)
    .toArray();

    return {
      users: users,
      count: users.length,
      requestTime: new Date(),
      // Connection automatically returned to pool
      poolStats: await getConnectionPoolStats()
    };

  } catch (error) {
    console.error('Database operation failed:', error);
    // Connection pool handles error recovery automatically
    throw error;
  }
  // No explicit connection cleanup needed - pool manages lifecycle
}

// Benefits of MongoDB connection pooling:
// - Automatic connection lifecycle management
// - Optimal resource utilization with min/max pool sizing
// - Built-in health monitoring and connection validation
// - Automatic failover and recovery handling  
// - Load balancing across replica set members
// - Intelligent connection reuse and optimization
// - Performance monitoring and metrics collection
// - Thread-safe operations without synchronization overhead
// - Graceful handling of network interruptions and timeouts
// - Integration with MongoDB driver performance features

Understanding MongoDB Connection Pool Management

Advanced Connection Pool Configuration and Monitoring

Implement sophisticated connection pool management for production environments:

// Comprehensive connection pool management system
class MongoConnectionPoolManager {
  constructor(config = {}) {
    this.config = {
      // Connection pool configuration
      maxPoolSize: config.maxPoolSize || 100,
      minPoolSize: config.minPoolSize || 5,
      maxIdleTimeMS: config.maxIdleTimeMS || 300000,
      maxConnecting: config.maxConnecting || 2,

      // Performance settings
      connectTimeoutMS: config.connectTimeoutMS || 10000,
      socketTimeoutMS: config.socketTimeoutMS || 30000,
      serverSelectionTimeoutMS: config.serverSelectionTimeoutMS || 5000,
      heartbeatFrequencyMS: config.heartbeatFrequencyMS || 10000,

      // High availability
      retryWrites: config.retryWrites !== false,
      retryReads: config.retryReads !== false,
      readPreference: config.readPreference || 'secondaryPreferred',

      // Monitoring
      enableMonitoring: config.enableMonitoring !== false,
      monitoringInterval: config.monitoringInterval || 30000,

      ...config
    };

    this.clients = new Map();
    this.poolMetrics = new Map();
    this.monitoringInterval = null;
    this.eventListeners = new Map();
  }

  async createClient(connectionString, databaseName, clientOptions = {}) {
    const clientConfig = {
      maxPoolSize: this.config.maxPoolSize,
      minPoolSize: this.config.minPoolSize,
      maxIdleTimeMS: this.config.maxIdleTimeMS,
      maxConnecting: this.config.maxConnecting,
      connectTimeoutMS: this.config.connectTimeoutMS,
      socketTimeoutMS: this.config.socketTimeoutMS,
      serverSelectionTimeoutMS: this.config.serverSelectionTimeoutMS,
      heartbeatFrequencyMS: this.config.heartbeatFrequencyMS,
      retryWrites: this.config.retryWrites,
      retryReads: this.config.retryReads,
      readPreference: this.config.readPreference,
      readConcern: { level: 'majority' },
      writeConcern: { 
        w: 'majority', 
        j: true,
        wtimeoutMS: 10000 
      },
      compressors: ['zstd', 'zlib', 'snappy'],
      serverApi: {
        version: ServerApiVersion.v1,
        strict: true,
        deprecationErrors: true
      },
      ...clientOptions
    };

    const client = new MongoClient(connectionString, clientConfig);

    // Set up connection pool event monitoring
    this.setupPoolEventListeners(client, databaseName);

    // Connect and validate
    await client.connect();

    // Store client reference
    this.clients.set(databaseName, {
      client: client,
      db: client.db(databaseName),
      connectionString: connectionString,
      config: clientConfig,
      createdAt: new Date(),
      lastUsed: new Date(),
      operationCount: 0,
      errorCount: 0
    });

    // Initialize metrics tracking
    this.poolMetrics.set(databaseName, {
      connectionsCreated: 0,
      connectionsDestroyed: 0,
      operationsExecuted: 0,
      operationErrors: 0,
      avgOperationTime: 0,
      poolSizeHistory: [],
      errorHistory: [],
      performanceMetrics: {
        p50ResponseTime: 0,
        p95ResponseTime: 0,
        p99ResponseTime: 0,
        errorRate: 0
      }
    });

    console.log(`MongoDB client created for database: ${databaseName}`);

    if (this.config.enableMonitoring) {
      this.startMonitoring();
    }

    return this.clients.get(databaseName);
  }

  setupPoolEventListeners(client, databaseName) {
    // Connection pool created
    client.on('connectionPoolCreated', (event) => {
      console.log(`Connection pool created for ${databaseName}:`, {
        address: event.address,
        options: event.options
      });

      if (this.poolMetrics.has(databaseName)) {
        this.poolMetrics.get(databaseName).poolCreatedAt = new Date();
      }
    });

    // Connection created
    client.on('connectionCreated', (event) => {
      console.log(`New connection created for ${databaseName}:`, {
        connectionId: event.connectionId,
        address: event.address
      });

      const metrics = this.poolMetrics.get(databaseName);
      if (metrics) {
        metrics.connectionsCreated++;
      }
    });

    // Connection ready
    client.on('connectionReady', (event) => {
      console.log(`Connection ready for ${databaseName}:`, {
        connectionId: event.connectionId,
        address: event.address
      });
    });

    // Connection closed
    client.on('connectionClosed', (event) => {
      console.log(`Connection closed for ${databaseName}:`, {
        connectionId: event.connectionId,
        address: event.address,
        reason: event.reason
      });

      const metrics = this.poolMetrics.get(databaseName);
      if (metrics) {
        metrics.connectionsDestroyed++;
      }
    });

    // Connection check out started
    client.on('connectionCheckOutStarted', (event) => {
      // Track connection pool usage patterns
      const metrics = this.poolMetrics.get(databaseName);
      if (metrics) {
        metrics.checkoutStartTime = Date.now();
      }
    });

    // Connection checked out
    client.on('connectionCheckedOut', (event) => {
      console.log(`Connection checked out for ${databaseName}:`, {
        connectionId: event.connectionId,
        address: event.address
      });

      const metrics = this.poolMetrics.get(databaseName);
      if (metrics && metrics.checkoutStartTime) {
        const checkoutTime = Date.now() - metrics.checkoutStartTime;
        metrics.avgCheckoutTime = (metrics.avgCheckoutTime || 0) * 0.9 + checkoutTime * 0.1;
      }
    });

    // Connection checked in
    client.on('connectionCheckedIn', (event) => {
      console.log(`Connection checked in for ${databaseName}:`, {
        connectionId: event.connectionId,
        address: event.address
      });
    });

    // Connection pool cleared
    client.on('connectionPoolCleared', (event) => {
      console.warn(`Connection pool cleared for ${databaseName}:`, {
        address: event.address,
        interruptInUseConnections: event.interruptInUseConnections
      });
    });

    // Server selection events
    client.on('serverOpening', (event) => {
      console.log(`Server opening for ${databaseName}:`, {
        address: event.address,
        topologyId: event.topologyId
      });
    });

    client.on('serverClosed', (event) => {
      console.log(`Server closed for ${databaseName}:`, {
        address: event.address,
        topologyId: event.topologyId
      });
    });

    client.on('serverDescriptionChanged', (event) => {
      console.log(`Server description changed for ${databaseName}:`, {
        address: event.address,
        newDescription: event.newDescription.type,
        previousDescription: event.previousDescription.type
      });
    });

    // Topology events
    client.on('topologyOpening', (event) => {
      console.log(`Topology opening for ${databaseName}:`, {
        topologyId: event.topologyId
      });
    });

    client.on('topologyClosed', (event) => {
      console.log(`Topology closed for ${databaseName}:`, {
        topologyId: event.topologyId
      });
    });

    client.on('topologyDescriptionChanged', (event) => {
      console.log(`Topology description changed for ${databaseName}:`, {
        topologyId: event.topologyId,
        newDescription: event.newDescription.type,
        previousDescription: event.previousDescription.type
      });
    });

    // Command monitoring for performance tracking
    client.on('commandStarted', (event) => {
      const clientInfo = this.clients.get(databaseName);
      if (clientInfo) {
        clientInfo.lastCommandStart = Date.now();
        clientInfo.lastCommand = {
          commandName: event.commandName,
          requestId: event.requestId,
          databaseName: event.databaseName
        };
      }
    });

    client.on('commandSucceeded', (event) => {
      const clientInfo = this.clients.get(databaseName);
      const metrics = this.poolMetrics.get(databaseName);

      if (clientInfo && metrics) {
        const duration = event.duration || (Date.now() - clientInfo.lastCommandStart);

        // Update operation metrics
        metrics.operationsExecuted++;
        metrics.avgOperationTime = (metrics.avgOperationTime * 0.95) + (duration * 0.05);

        clientInfo.operationCount++;
        clientInfo.lastUsed = new Date();

        // Track performance percentiles (simplified)
        this.updatePerformanceMetrics(databaseName, duration, true);
      }
    });

    client.on('commandFailed', (event) => {
      console.error(`Command failed for ${databaseName}:`, {
        commandName: event.commandName,
        failure: event.failure.message,
        duration: event.duration
      });

      const clientInfo = this.clients.get(databaseName);
      const metrics = this.poolMetrics.get(databaseName);

      if (clientInfo && metrics) {
        clientInfo.errorCount++;
        metrics.operationErrors++;

        metrics.errorHistory.push({
          timestamp: new Date(),
          command: event.commandName,
          error: event.failure.message,
          duration: event.duration
        });

        // Keep only recent error history
        if (metrics.errorHistory.length > 100) {
          metrics.errorHistory.shift();
        }

        this.updatePerformanceMetrics(databaseName, event.duration, false);
      }
    });
  }

  updatePerformanceMetrics(databaseName, duration, success) {
    const metrics = this.poolMetrics.get(databaseName);
    if (!metrics) return;

    // Simple sliding window for performance metrics
    if (!metrics.responseTimeWindow) {
      metrics.responseTimeWindow = [];
    }

    metrics.responseTimeWindow.push({
      timestamp: Date.now(),
      duration: duration,
      success: success
    });

    // Keep only last 1000 operations
    if (metrics.responseTimeWindow.length > 1000) {
      metrics.responseTimeWindow.shift();
    }

    // Calculate percentiles (simplified)
    const successfulOperations = metrics.responseTimeWindow
      .filter(op => op.success)
      .map(op => op.duration)
      .sort((a, b) => a - b);

    if (successfulOperations.length > 0) {
      const p50Index = Math.floor(successfulOperations.length * 0.5);
      const p95Index = Math.floor(successfulOperations.length * 0.95);
      const p99Index = Math.floor(successfulOperations.length * 0.99);

      metrics.performanceMetrics.p50ResponseTime = successfulOperations[p50Index] || 0;
      metrics.performanceMetrics.p95ResponseTime = successfulOperations[p95Index] || 0;
      metrics.performanceMetrics.p99ResponseTime = successfulOperations[p99Index] || 0;
    }

    // Calculate error rate
    const recentOperations = metrics.responseTimeWindow.filter(
      op => Date.now() - op.timestamp < 300000 // Last 5 minutes
    );

    if (recentOperations.length > 0) {
      const errorCount = recentOperations.filter(op => !op.success).length;
      metrics.performanceMetrics.errorRate = (errorCount / recentOperations.length) * 100;
    }
  }

  async getClient(databaseName) {
    const clientInfo = this.clients.get(databaseName);
    if (!clientInfo) {
      throw new Error(`No client found for database: ${databaseName}`);
    }

    // Check client health
    try {
      await clientInfo.client.db('admin').admin().ping();
      clientInfo.lastUsed = new Date();
      return clientInfo;
    } catch (error) {
      console.error(`Client health check failed for ${databaseName}:`, error);

      // Attempt to reconnect
      try {
        await this.reconnectClient(databaseName);
        return this.clients.get(databaseName);
      } catch (reconnectError) {
        console.error(`Reconnection failed for ${databaseName}:`, reconnectError);
        throw reconnectError;
      }
    }
  }

  async reconnectClient(databaseName) {
    const clientInfo = this.clients.get(databaseName);
    if (!clientInfo) {
      throw new Error(`No client configuration found for database: ${databaseName}`);
    }

    console.log(`Attempting to reconnect client for ${databaseName}...`);

    try {
      // Close existing client
      await clientInfo.client.close();
    } catch (closeError) {
      console.warn(`Error closing existing client: ${closeError.message}`);
    }

    // Create new client with existing configuration
    await this.createClient(
      clientInfo.connectionString,
      databaseName,
      clientInfo.config
    );

    console.log(`Successfully reconnected client for ${databaseName}`);
  }

  async executeWithPool(databaseName, operation) {
    const startTime = Date.now();
    let success = true;

    try {
      const clientInfo = await this.getClient(databaseName);
      const result = await operation(clientInfo.db, clientInfo.client);

      return result;

    } catch (error) {
      success = false;
      console.error(`Operation failed for ${databaseName}:`, error);
      throw error;

    } finally {
      const duration = Date.now() - startTime;
      this.updatePerformanceMetrics(databaseName, duration, success);
    }
  }

  async getConnectionPoolStats(databaseName) {
    if (!databaseName) {
      // Return stats for all databases
      const allStats = {};
      for (const [dbName, clientInfo] of this.clients.entries()) {
        allStats[dbName] = await this.getSingleDatabaseStats(dbName, clientInfo);
      }
      return allStats;
    }

    const clientInfo = this.clients.get(databaseName);
    if (!clientInfo) {
      throw new Error(`No client found for database: ${databaseName}`);
    }

    return await this.getSingleDatabaseStats(databaseName, clientInfo);
  }

  async getSingleDatabaseStats(databaseName, clientInfo) {
    const metrics = this.poolMetrics.get(databaseName);

    try {
      // Get current server status
      const serverStatus = await clientInfo.client.db('admin').admin().serverStatus();
      const connectionPoolStats = serverStatus.connections || {};

      return {
        database: databaseName,

        // Basic connection info
        connectionString: clientInfo.connectionString.replace(/\/\/.*@/, '//***@'), // Hide credentials
        createdAt: clientInfo.createdAt,
        lastUsed: clientInfo.lastUsed,

        // Pool configuration
        poolConfig: {
          maxPoolSize: this.config.maxPoolSize,
          minPoolSize: this.config.minPoolSize,
          maxIdleTimeMS: this.config.maxIdleTimeMS,
          maxConnecting: this.config.maxConnecting
        },

        // Current pool status
        poolStatus: {
          current: connectionPoolStats.current || 0,
          available: connectionPoolStats.available || 0,
          active: connectionPoolStats.active || 0,
          totalCreated: connectionPoolStats.totalCreated || 0
        },

        // Operation metrics
        operations: {
          totalOperations: clientInfo.operationCount,
          totalErrors: clientInfo.errorCount,
          errorRate: clientInfo.operationCount > 0 ? 
            ((clientInfo.errorCount / clientInfo.operationCount) * 100).toFixed(2) + '%' : '0%'
        },

        // Performance metrics
        performance: metrics ? {
          avgOperationTime: Math.round(metrics.avgOperationTime || 0) + 'ms',
          p50ResponseTime: Math.round(metrics.performanceMetrics.p50ResponseTime || 0) + 'ms',
          p95ResponseTime: Math.round(metrics.performanceMetrics.p95ResponseTime || 0) + 'ms',
          p99ResponseTime: Math.round(metrics.performanceMetrics.p99ResponseTime || 0) + 'ms',
          currentErrorRate: (metrics.performanceMetrics.errorRate || 0).toFixed(2) + '%',
          avgCheckoutTime: Math.round(metrics.avgCheckoutTime || 0) + 'ms'
        } : null,

        // Historical data
        history: metrics ? {
          connectionsCreated: metrics.connectionsCreated,
          connectionsDestroyed: metrics.connectionsDestroyed,
          operationsExecuted: metrics.operationsExecuted,
          operationErrors: metrics.operationErrors,
          recentErrors: metrics.errorHistory.slice(-5) // Last 5 errors
        } : null,

        // Health assessment
        health: this.assessConnectionHealth(clientInfo, metrics, connectionPoolStats),

        statsGeneratedAt: new Date()
      };

    } catch (error) {
      console.error(`Error getting stats for ${databaseName}:`, error);

      return {
        database: databaseName,
        error: error.message,
        lastKnownGoodStats: {
          createdAt: clientInfo.createdAt,
          lastUsed: clientInfo.lastUsed,
          operationCount: clientInfo.operationCount,
          errorCount: clientInfo.errorCount
        },
        statsGeneratedAt: new Date()
      };
    }
  }

  assessConnectionHealth(clientInfo, metrics, connectionPoolStats) {
    const health = {
      overall: 'healthy',
      issues: [],
      recommendations: []
    };

    // Check error rate
    if (clientInfo.operationCount > 0) {
      const errorRate = (clientInfo.errorCount / clientInfo.operationCount) * 100;
      if (errorRate > 10) {
        health.issues.push(`High error rate: ${errorRate.toFixed(2)}%`);
        health.overall = 'unhealthy';
      } else if (errorRate > 5) {
        health.issues.push(`Elevated error rate: ${errorRate.toFixed(2)}%`);
        health.overall = 'warning';
      }
    }

    // Check connection pool utilization
    const poolUtilization = connectionPoolStats.current / this.config.maxPoolSize;
    if (poolUtilization > 0.9) {
      health.issues.push(`High pool utilization: ${(poolUtilization * 100).toFixed(1)}%`);
      health.recommendations.push('Consider increasing maxPoolSize');
      if (health.overall === 'healthy') health.overall = 'warning';
    }

    // Check average response time
    if (metrics && metrics.avgOperationTime > 5000) {
      health.issues.push(`High average response time: ${metrics.avgOperationTime.toFixed(0)}ms`);
      health.recommendations.push('Investigate query performance and indexing');
      if (health.overall === 'healthy') health.overall = 'warning';
    }

    // Check recent errors
    if (metrics && metrics.errorHistory.length > 0) {
      const recentErrors = metrics.errorHistory.filter(
        error => Date.now() - error.timestamp.getTime() < 300000 // Last 5 minutes
      );

      if (recentErrors.length > 5) {
        health.issues.push(`Multiple recent errors: ${recentErrors.length} in last 5 minutes`);
        health.recommendations.push('Check application logs and network connectivity');
        health.overall = 'unhealthy';
      }
    }

    // Check last usage
    const timeSinceLastUse = Date.now() - clientInfo.lastUsed.getTime();
    if (timeSinceLastUse > 3600000) { // 1 hour
      health.issues.push(`Client unused for ${Math.round(timeSinceLastUse / 60000)} minutes`);
      health.recommendations.push('Consider closing idle connections');
    }

    return health;
  }

  startMonitoring() {
    if (this.monitoringInterval) {
      return; // Already monitoring
    }

    console.log(`Starting connection pool monitoring (interval: ${this.config.monitoringInterval}ms)`);

    this.monitoringInterval = setInterval(async () => {
      try {
        await this.performMonitoringCheck();
      } catch (error) {
        console.error('Monitoring check failed:', error);
      }
    }, this.config.monitoringInterval);
  }

  async performMonitoringCheck() {
    for (const [databaseName, clientInfo] of this.clients.entries()) {
      try {
        const stats = await this.getSingleDatabaseStats(databaseName, clientInfo);

        // Log health issues
        if (stats.health && stats.health.overall !== 'healthy') {
          console.warn(`Health check for ${databaseName}:`, {
            status: stats.health.overall,
            issues: stats.health.issues,
            recommendations: stats.health.recommendations
          });
        }

        // Store historical pool size data
        const metrics = this.poolMetrics.get(databaseName);
        if (metrics && stats.poolStatus) {
          metrics.poolSizeHistory.push({
            timestamp: new Date(),
            current: stats.poolStatus.current,
            available: stats.poolStatus.available,
            active: stats.poolStatus.active
          });

          // Keep only last 24 hours of pool size history
          const oneDayAgo = Date.now() - 24 * 60 * 60 * 1000;
          metrics.poolSizeHistory = metrics.poolSizeHistory.filter(
            entry => entry.timestamp.getTime() > oneDayAgo
          );
        }

        // Emit monitoring event if listeners are registered
        if (this.eventListeners.has('monitoring_check')) {
          this.eventListeners.get('monitoring_check').forEach(listener => {
            listener(databaseName, stats);
          });
        }

      } catch (error) {
        console.error(`Monitoring check failed for ${databaseName}:`, error);
      }
    }
  }

  stopMonitoring() {
    if (this.monitoringInterval) {
      clearInterval(this.monitoringInterval);
      this.monitoringInterval = null;
      console.log('Connection pool monitoring stopped');
    }
  }

  addEventListener(eventName, listener) {
    if (!this.eventListeners.has(eventName)) {
      this.eventListeners.set(eventName, []);
    }
    this.eventListeners.get(eventName).push(listener);
  }

  removeEventListener(eventName, listener) {
    if (this.eventListeners.has(eventName)) {
      const listeners = this.eventListeners.get(eventName);
      const index = listeners.indexOf(listener);
      if (index > -1) {
        listeners.splice(index, 1);
      }
    }
  }

  async closeClient(databaseName) {
    const clientInfo = this.clients.get(databaseName);
    if (!clientInfo) {
      console.warn(`No client found for database: ${databaseName}`);
      return;
    }

    try {
      await clientInfo.client.close();
      this.clients.delete(databaseName);
      this.poolMetrics.delete(databaseName);
      console.log(`Client closed for database: ${databaseName}`);
    } catch (error) {
      console.error(`Error closing client for ${databaseName}:`, error);
    }
  }

  async closeAllClients() {
    const closePromises = [];

    for (const databaseName of this.clients.keys()) {
      closePromises.push(this.closeClient(databaseName));
    }

    this.stopMonitoring();

    await Promise.all(closePromises);
    console.log('All MongoDB clients closed');
  }
}

High-Performance Connection Pool Patterns

Implement specialized connection pool patterns for different application scenarios:

// Specialized connection pool patterns for different use cases
class SpecializedConnectionPools {
  constructor() {
    this.poolManager = new MongoConnectionPoolManager();
    this.pools = new Map();
  }

  async createReadWritePools(config) {
    // Separate connection pools for read and write operations
    const writePoolConfig = {
      maxPoolSize: config.writeMaxPool || 50,
      minPoolSize: config.writeMinPool || 5,
      readPreference: 'primary',
      writeConcern: { w: 'majority', j: true },
      readConcern: { level: 'majority' },
      retryWrites: true,
      heartbeatFrequencyMS: 5000
    };

    const readPoolConfig = {
      maxPoolSize: config.readMaxPool || 100,
      minPoolSize: config.readMinPool || 10,
      readPreference: 'secondaryPreferred',
      readConcern: { level: 'available' }, // Faster reads
      retryReads: true,
      heartbeatFrequencyMS: 10000,
      maxIdleTimeMS: 600000 // Keep read connections longer
    };

    // Create separate clients for read and write
    const writeClient = await this.poolManager.createClient(
      config.connectionString,
      `${config.databaseName}_write`,
      writePoolConfig
    );

    const readClient = await this.poolManager.createClient(
      config.connectionString,
      `${config.databaseName}_read`,
      readPoolConfig
    );

    this.pools.set(`${config.databaseName}_readwrite`, {
      writeClient: writeClient,
      readClient: readClient,
      createdAt: new Date()
    });

    return {
      writeClient: writeClient,
      readClient: readClient,

      // Convenience methods
      executeWrite: (operation) => this.poolManager.executeWithPool(
        `${config.databaseName}_write`, operation
      ),
      executeRead: (operation) => this.poolManager.executeWithPool(
        `${config.databaseName}_read`, operation
      )
    };
  }

  async createTenantAwarePools(tenantConfigs) {
    // Multi-tenant connection pooling with per-tenant isolation
    const tenantPools = new Map();

    for (const tenantConfig of tenantConfigs) {
      const tenantId = tenantConfig.tenantId;
      const poolConfig = {
        maxPoolSize: tenantConfig.maxPool || 20,
        minPoolSize: tenantConfig.minPool || 2,
        maxIdleTimeMS: tenantConfig.idleTimeout || 300000,

        // Tenant-specific settings
        appName: `app_tenant_${tenantId}`,
        authSource: tenantConfig.authDatabase || 'admin',

        // Resource limits per tenant
        serverSelectionTimeoutMS: 5000,
        connectTimeoutMS: 10000
      };

      const client = await this.poolManager.createClient(
        tenantConfig.connectionString,
        `tenant_${tenantId}`,
        poolConfig
      );

      tenantPools.set(tenantId, {
        client: client,
        config: tenantConfig,
        createdAt: new Date(),
        lastUsed: new Date(),
        operationCount: 0
      });
    }

    this.pools.set('tenant_pools', tenantPools);

    return {
      executeForTenant: async (tenantId, operation) => {
        const tenantPool = tenantPools.get(tenantId);
        if (!tenantPool) {
          throw new Error(`No pool configured for tenant: ${tenantId}`);
        }

        tenantPool.lastUsed = new Date();
        tenantPool.operationCount++;

        return await this.poolManager.executeWithPool(
          `tenant_${tenantId}`,
          operation
        );
      },

      getTenantStats: async (tenantId) => {
        if (tenantId) {
          return await this.poolManager.getConnectionPoolStats(`tenant_${tenantId}`);
        } else {
          // Return stats for all tenants
          const allStats = {};
          for (const [tId, poolInfo] of tenantPools.entries()) {
            allStats[tId] = await this.poolManager.getConnectionPoolStats(`tenant_${tId}`);
          }
          return allStats;
        }
      }
    };
  }

  async createGeographicPools(regionConfigs) {
    // Geographic connection pools for global applications
    const regionPools = new Map();

    for (const regionConfig of regionConfigs) {
      const region = regionConfig.region;
      const poolConfig = {
        maxPoolSize: regionConfig.maxPool || 75,
        minPoolSize: regionConfig.minPool || 10,

        // Region-specific optimizations
        connectTimeoutMS: regionConfig.connectTimeout || 15000,
        serverSelectionTimeoutMS: regionConfig.selectionTimeout || 10000,
        heartbeatFrequencyMS: regionConfig.heartbeatFreq || 10000,

        // Compression for long-distance connections
        compressors: ['zstd', 'zlib'],

        // Read preference based on region
        readPreference: regionConfig.readPreference || 'nearest',

        appName: `app_${region}`
      };

      const client = await this.poolManager.createClient(
        regionConfig.connectionString,
        `region_${region}`,
        poolConfig
      );

      regionPools.set(region, {
        client: client,
        config: regionConfig,
        createdAt: new Date(),
        lastUsed: new Date(),
        latencyMetrics: {
          avgLatency: 0,
          minLatency: Number.MAX_VALUE,
          maxLatency: 0,
          measurements: []
        }
      });
    }

    this.pools.set('region_pools', regionPools);

    return {
      executeInRegion: async (region, operation) => {
        const regionPool = regionPools.get(region);
        if (!regionPool) {
          throw new Error(`No pool configured for region: ${region}`);
        }

        const startTime = Date.now();

        try {
          const result = await this.poolManager.executeWithPool(
            `region_${region}`,
            operation
          );

          // Track latency metrics
          const latency = Date.now() - startTime;
          this.updateRegionLatencyMetrics(region, latency);

          regionPool.lastUsed = new Date();
          return result;

        } catch (error) {
          const latency = Date.now() - startTime;
          this.updateRegionLatencyMetrics(region, latency);
          throw error;
        }
      },

      selectOptimalRegion: async (preferredRegions = []) => {
        // Select region with best performance characteristics
        let bestRegion = null;
        let bestScore = -1;

        for (const region of preferredRegions.length > 0 ? preferredRegions : regionPools.keys()) {
          const regionPool = regionPools.get(region);
          if (!regionPool) continue;

          const stats = await this.poolManager.getConnectionPoolStats(`region_${region}`);
          const latencyMetrics = regionPool.latencyMetrics;

          // Calculate performance score (lower latency + higher availability)
          let score = 100;
          score -= Math.min(latencyMetrics.avgLatency / 10, 50); // Latency penalty
          score -= (parseFloat(stats.operations.errorRate) || 0); // Error rate penalty

          if (stats.health.overall === 'unhealthy') score -= 30;
          else if (stats.health.overall === 'warning') score -= 15;

          if (score > bestScore) {
            bestScore = score;
            bestRegion = region;
          }
        }

        return {
          region: bestRegion,
          score: bestScore,
          metrics: bestRegion ? regionPools.get(bestRegion).latencyMetrics : null
        };
      }
    };
  }

  updateRegionLatencyMetrics(region, latency) {
    const regionPools = this.pools.get('region_pools');
    const regionPool = regionPools?.get(region);

    if (regionPool) {
      const metrics = regionPool.latencyMetrics;

      // Update latency statistics
      metrics.measurements.push({
        timestamp: Date.now(),
        latency: latency
      });

      // Keep only recent measurements (last 1000)
      if (metrics.measurements.length > 1000) {
        metrics.measurements.shift();
      }

      // Calculate running averages
      const recentMeasurements = metrics.measurements.slice(-100); // Last 100 measurements
      metrics.avgLatency = recentMeasurements.reduce((sum, m) => sum + m.latency, 0) / recentMeasurements.length;
      metrics.minLatency = Math.min(metrics.minLatency, latency);
      metrics.maxLatency = Math.max(metrics.maxLatency, latency);
    }
  }

  async createPriorityPools(priorityConfig) {
    // Priority-based connection pooling for different service levels
    const priorityLevels = ['critical', 'high', 'normal', 'low'];
    const priorityPools = new Map();

    for (const priority of priorityLevels) {
      const config = priorityConfig[priority] || {};
      const poolConfig = {
        maxPoolSize: config.maxPool || this.getDefaultPoolSize(priority),
        minPoolSize: config.minPool || this.getDefaultMinPool(priority),
        maxIdleTimeMS: config.idleTimeout || this.getDefaultIdleTimeout(priority),

        // Priority-specific timeouts
        connectTimeoutMS: config.connectTimeout || this.getDefaultConnectTimeout(priority),
        socketTimeoutMS: config.socketTimeout || this.getDefaultSocketTimeout(priority),
        serverSelectionTimeoutMS: config.selectionTimeout || this.getDefaultSelectionTimeout(priority),

        // Quality of service settings
        readConcern: { level: priority === 'critical' ? 'majority' : 'available' },
        writeConcern: priority === 'critical' ? 
          { w: 'majority', j: true, wtimeout: 10000 } : 
          { w: 1, wtimeout: 5000 },

        appName: `app_priority_${priority}`
      };

      const client = await this.poolManager.createClient(
        priorityConfig.connectionString,
        `priority_${priority}`,
        poolConfig
      );

      priorityPools.set(priority, {
        client: client,
        priority: priority,
        config: poolConfig,
        createdAt: new Date(),
        queuedOperations: 0,
        completedOperations: 0,
        rejectedOperations: 0
      });
    }

    this.pools.set('priority_pools', priorityPools);

    return {
      executeWithPriority: async (priority, operation, options = {}) => {
        const priorityPool = priorityPools.get(priority);
        if (!priorityPool) {
          throw new Error(`No pool configured for priority: ${priority}`);
        }

        // Check if pool is overloaded and priority allows rejection
        if (this.shouldRejectLowPriorityOperation(priority, priorityPool)) {
          priorityPool.rejectedOperations++;
          throw new Error(`Operation rejected due to high load (priority: ${priority})`);
        }

        priorityPool.queuedOperations++;

        try {
          const result = await this.poolManager.executeWithPool(
            `priority_${priority}`,
            operation
          );

          priorityPool.completedOperations++;
          priorityPool.queuedOperations--;

          return result;

        } catch (error) {
          priorityPool.queuedOperations--;
          throw error;
        }
      },

      getPriorityStats: async () => {
        const stats = {};

        for (const [priority, poolInfo] of priorityPools.entries()) {
          const poolStats = await this.poolManager.getConnectionPoolStats(`priority_${priority}`);

          stats[priority] = {
            ...poolStats,
            queuedOperations: poolInfo.queuedOperations,
            completedOperations: poolInfo.completedOperations,
            rejectedOperations: poolInfo.rejectedOperations,
            rejectionRate: poolInfo.completedOperations > 0 ? 
              ((poolInfo.rejectedOperations / (poolInfo.completedOperations + poolInfo.rejectedOperations)) * 100).toFixed(2) + '%' : 
              '0%'
          };
        }

        return stats;
      },

      adjustPriorityLimits: async (priority, newLimits) => {
        // Dynamic adjustment of priority pool limits
        const priorityPool = priorityPools.get(priority);
        if (priorityPool) {
          // This would require reconnecting with new pool settings
          console.log(`Adjusting limits for priority ${priority}:`, newLimits);
          // Implementation would depend on specific requirements
        }
      }
    };
  }

  getDefaultPoolSize(priority) {
    const sizes = {
      critical: 100,
      high: 75,
      normal: 50,
      low: 25
    };
    return sizes[priority] || 50;
  }

  getDefaultMinPool(priority) {
    const sizes = {
      critical: 10,
      high: 8,
      normal: 5,
      low: 2
    };
    return sizes[priority] || 5;
  }

  getDefaultIdleTimeout(priority) {
    const timeouts = {
      critical: 60000,   // 1 minute
      high: 120000,      // 2 minutes  
      normal: 300000,    // 5 minutes
      low: 600000        // 10 minutes
    };
    return timeouts[priority] || 300000;
  }

  getDefaultConnectTimeout(priority) {
    const timeouts = {
      critical: 5000,   // 5 seconds
      high: 8000,       // 8 seconds
      normal: 10000,    // 10 seconds
      low: 15000        // 15 seconds
    };
    return timeouts[priority] || 10000;
  }

  getDefaultSocketTimeout(priority) {
    const timeouts = {
      critical: 10000,  // 10 seconds
      high: 20000,      // 20 seconds
      normal: 30000,    // 30 seconds
      low: 60000        // 60 seconds
    };
    return timeouts[priority] || 30000;
  }

  getDefaultSelectionTimeout(priority) {
    const timeouts = {
      critical: 3000,   // 3 seconds
      high: 5000,       // 5 seconds
      normal: 8000,     // 8 seconds
      low: 15000        // 15 seconds
    };
    return timeouts[priority] || 8000;
  }

  shouldRejectLowPriorityOperation(priority, priorityPool) {
    // Simple load-based rejection for low priority operations
    if (priority === 'low' && priorityPool.queuedOperations > 10) {
      return true;
    }

    if (priority === 'normal' && priorityPool.queuedOperations > 25) {
      return true;
    }

    return false;
  }

  async createBatchProcessingPool(config) {
    // Specialized pool for batch processing operations
    const batchPoolConfig = {
      maxPoolSize: config.maxPool || 200,
      minPoolSize: config.minPool || 20,
      maxConnecting: config.maxConnecting || 10,

      // Longer timeouts for batch operations
      connectTimeoutMS: config.connectTimeout || 30000,
      socketTimeoutMS: config.socketTimeout || 120000,
      serverSelectionTimeoutMS: config.selectionTimeout || 15000,

      // Optimized for bulk operations
      maxIdleTimeMS: config.idleTimeout || 1800000, // 30 minutes
      heartbeatFrequencyMS: config.heartbeatFreq || 30000,

      // Bulk operation settings
      readPreference: 'secondary',
      readConcern: { level: 'available' },
      writeConcern: { w: 1 }, // Faster writes for bulk operations

      // Compression for large data transfers
      compressors: ['zstd', 'zlib'],

      appName: 'batch_processor'
    };

    const client = await this.poolManager.createClient(
      config.connectionString,
      'batch_processing',
      batchPoolConfig
    );

    this.pools.set('batch_pool', {
      client: client,
      config: batchPoolConfig,
      createdAt: new Date(),
      batchesProcessed: 0,
      documentsProcessed: 0,
      avgBatchSize: 0
    });

    return {
      executeBatch: async (operation, batchSize = 1000) => {
        const batchPool = this.pools.get('batch_pool');
        const startTime = Date.now();

        try {
          const result = await this.poolManager.executeWithPool(
            'batch_processing',
            async (db, client) => {
              // Configure batch operation settings
              const options = {
                ordered: false,        // Allow partial success
                bypassDocumentValidation: true, // Skip validation for performance
                writeConcern: { w: 1 } // Fast acknowledgment
              };

              return await operation(db, client, options);
            }
          );

          // Update batch statistics
          batchPool.batchesProcessed++;
          batchPool.documentsProcessed += batchSize;
          batchPool.avgBatchSize = (batchPool.avgBatchSize * 0.9) + (batchSize * 0.1);

          const duration = Date.now() - startTime;
          console.log(`Batch operation completed: ${batchSize} documents in ${duration}ms`);

          return result;

        } catch (error) {
          console.error('Batch operation failed:', error);
          throw error;
        }
      },

      getBatchStats: async () => {
        const poolStats = await this.poolManager.getConnectionPoolStats('batch_processing');
        const batchPool = this.pools.get('batch_pool');

        return {
          ...poolStats,
          batchStatistics: {
            batchesProcessed: batchPool.batchesProcessed,
            documentsProcessed: batchPool.documentsProcessed,
            avgBatchSize: Math.round(batchPool.avgBatchSize),
            documentsPerBatch: batchPool.batchesProcessed > 0 ? 
              Math.round(batchPool.documentsProcessed / batchPool.batchesProcessed) : 0
          }
        };
      }
    };
  }

  async getOverallStats() {
    // Get comprehensive statistics across all pool types
    const overallStats = {
      pools: {},
      summary: {
        totalPools: 0,
        totalConnections: 0,
        totalOperations: 0,
        overallHealth: 'healthy',
        generatedAt: new Date()
      }
    };

    // Get stats from pool manager
    const allPoolStats = await this.poolManager.getConnectionPoolStats();

    for (const [poolName, stats] of Object.entries(allPoolStats)) {
      overallStats.pools[poolName] = stats;
      overallStats.summary.totalPools++;
      overallStats.summary.totalConnections += stats.poolStatus?.current || 0;
      overallStats.summary.totalOperations += stats.operations?.totalOperations || 0;

      // Aggregate health status
      if (stats.health?.overall === 'unhealthy') {
        overallStats.summary.overallHealth = 'unhealthy';
      } else if (stats.health?.overall === 'warning' && overallStats.summary.overallHealth !== 'unhealthy') {
        overallStats.summary.overallHealth = 'warning';
      }
    }

    return overallStats;
  }

  async shutdown() {
    // Graceful shutdown of all pools
    console.log('Shutting down all connection pools...');

    await this.poolManager.closeAllClients();
    this.pools.clear();

    console.log('All connection pools shut down successfully');
  }
}

SQL-Style Connection Pool Management with QueryLeaf

QueryLeaf provides familiar SQL approaches to MongoDB connection pool configuration and monitoring:

-- QueryLeaf connection pool management with SQL-familiar syntax

-- Configure connection pool settings
SET CONNECTION_POOL_OPTIONS = JSON_BUILD_OBJECT(
  'maxPoolSize', 100,
  'minPoolSize', 5,
  'maxIdleTimeMS', 300000,
  'maxConnecting', 2,
  'connectTimeoutMS', 10000,
  'socketTimeoutMS', 30000,
  'serverSelectionTimeoutMS', 5000,
  'heartbeatFrequencyMS', 10000,
  'retryWrites', true,
  'retryReads', true,
  'readPreference', 'secondaryPreferred',
  'writeConcern', JSON_BUILD_OBJECT('w', 'majority', 'j', true),
  'compressors', ARRAY['zstd', 'zlib', 'snappy']
);

-- Create specialized connection pools for different workloads
CREATE CONNECTION_POOL read_pool 
WITH (
  maxPoolSize = 150,
  minPoolSize = 10,
  readPreference = 'secondaryPreferred',
  readConcern = JSON_BUILD_OBJECT('level', 'available'),
  maxIdleTimeMS = 600000
);

CREATE CONNECTION_POOL write_pool
WITH (
  maxPoolSize = 75,
  minPoolSize = 5,
  readPreference = 'primary',
  writeConcern = JSON_BUILD_OBJECT('w', 'majority', 'j', true),
  retryWrites = true,
  maxIdleTimeMS = 300000
);

CREATE CONNECTION_POOL batch_pool
WITH (
  maxPoolSize = 200,
  minPoolSize = 20,
  maxConnecting = 10,
  socketTimeoutMS = 120000,
  maxIdleTimeMS = 1800000,
  compressors = ARRAY['zstd', 'zlib']
);

-- Monitor connection pool performance and health
SELECT 
  CONNECTION_POOL_NAME() as pool_name,
  CONNECTION_POOL_MAX_SIZE() as max_connections,
  CONNECTION_POOL_CURRENT_SIZE() as current_connections,
  CONNECTION_POOL_AVAILABLE() as available_connections,
  CONNECTION_POOL_ACTIVE() as active_connections,

  -- Utilization metrics
  ROUND((CONNECTION_POOL_CURRENT_SIZE()::float / CONNECTION_POOL_MAX_SIZE()) * 100, 2) as pool_utilization_pct,
  ROUND((CONNECTION_POOL_ACTIVE()::float / CONNECTION_POOL_CURRENT_SIZE()) * 100, 2) as connection_active_pct,

  -- Performance metrics
  CONNECTION_POOL_TOTAL_CREATED() as total_connections_created,
  CONNECTION_POOL_TOTAL_DESTROYED() as total_connections_destroyed,
  CONNECTION_POOL_AVG_CHECKOUT_TIME() as avg_checkout_time_ms,
  CONNECTION_POOL_OPERATION_COUNT() as total_operations,
  CONNECTION_POOL_ERROR_COUNT() as total_errors,
  ROUND((CONNECTION_POOL_ERROR_COUNT()::float / CONNECTION_POOL_OPERATION_COUNT()) * 100, 2) as error_rate_pct,

  -- Health assessment
  CASE 
    WHEN CONNECTION_POOL_ERROR_COUNT()::float / CONNECTION_POOL_OPERATION_COUNT() > 0.1 THEN 'UNHEALTHY'
    WHEN CONNECTION_POOL_CURRENT_SIZE()::float / CONNECTION_POOL_MAX_SIZE() > 0.9 THEN 'WARNING'
    WHEN CONNECTION_POOL_AVG_CHECKOUT_TIME() > 1000 THEN 'WARNING'
    ELSE 'HEALTHY'
  END as health_status,

  CONNECTION_POOL_LAST_USED() as last_used,
  CONNECTION_POOL_CREATED_AT() as created_at

FROM CONNECTION_POOLS()
ORDER BY pool_utilization_pct DESC;

-- High-performance database operations using connection pools
-- Read operations using read pool
SELECT @read_pool := USE_CONNECTION_POOL('read_pool');

WITH user_analytics AS (
  SELECT 
    u.user_id,
    u.username,
    u.email,
    u.created_at,
    u.last_login,
    u.subscription_type,

    -- Calculate user engagement metrics
    COUNT(a.activity_id) as total_activities,
    MAX(a.activity_date) as last_activity,
    AVG(a.session_duration) as avg_session_duration,
    SUM(a.page_views) as total_page_views,

    -- User value calculation  
    COUNT(DISTINCT o.order_id) as total_orders,
    SUM(o.order_total) as lifetime_value,
    AVG(o.order_total) as avg_order_value

  FROM users u
  LEFT JOIN user_activities a ON u.user_id = a.user_id 
    AND a.activity_date >= CURRENT_DATE - INTERVAL '90 days'
  LEFT JOIN orders o ON u.user_id = o.customer_id
    AND o.status = 'completed'

  WHERE u.status = 'active'
    AND u.created_at >= CURRENT_DATE - INTERVAL '1 year'

  GROUP BY u.user_id, u.username, u.email, u.created_at, u.last_login, u.subscription_type
)
SELECT 
  user_id,
  username,
  email,
  subscription_type,
  total_activities,
  last_activity,
  ROUND(avg_session_duration / 60, 2) as avg_session_minutes,
  total_page_views,
  total_orders,
  COALESCE(lifetime_value, 0) as lifetime_value,
  ROUND(COALESCE(avg_order_value, 0), 2) as avg_order_value,

  -- User segmentation
  CASE 
    WHEN total_orders >= 10 AND lifetime_value >= 1000 THEN 'VIP'
    WHEN total_orders >= 5 AND lifetime_value >= 500 THEN 'LOYAL'  
    WHEN total_orders >= 1 THEN 'CUSTOMER'
    WHEN total_activities >= 10 THEN 'ENGAGED'
    ELSE 'NEW'
  END as user_segment,

  -- Engagement score
  ROUND(
    (COALESCE(total_activities, 0) * 0.3) + 
    (COALESCE(total_orders, 0) * 0.4) + 
    (LEAST(COALESCE(total_page_views, 0) / 100, 10) * 0.3), 
    2
  ) as engagement_score

FROM user_analytics
WHERE total_activities > 0 OR total_orders > 0
ORDER BY engagement_score DESC, lifetime_value DESC
LIMIT 1000;

-- Write operations using write pool
SELECT @write_pool := USE_CONNECTION_POOL('write_pool');

-- Bulk insert with optimized connection pool
INSERT INTO user_events (
  user_id,
  event_type,
  event_data,
  session_id,
  timestamp,
  ip_address,
  user_agent
)
SELECT 
  user_session.user_id,
  event_batch.event_type,
  event_batch.event_data,
  user_session.session_id,
  event_batch.timestamp,
  user_session.ip_address,
  user_session.user_agent
FROM UNNEST(@event_batch_array) AS event_batch(event_type, event_data, timestamp, user_id)
JOIN user_sessions user_session ON event_batch.user_id = user_session.user_id
WHERE user_session.is_active = true
  AND event_batch.timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour';

-- Batch processing using specialized batch pool
SELECT @batch_pool := USE_CONNECTION_POOL('batch_pool');

-- Process large dataset with batch operations  
WITH batch_processing AS (
  UPDATE user_statistics 
  SET 
    monthly_page_views = monthly_stats.page_views,
    monthly_session_time = monthly_stats.session_time,
    monthly_orders = monthly_stats.orders,
    monthly_revenue = monthly_stats.revenue,
    last_calculated = CURRENT_TIMESTAMP,
    calculation_version = calculation_version + 1
  FROM (
    SELECT 
      u.user_id,
      COUNT(a.activity_id) as page_views,
      SUM(a.session_duration) as session_time,
      COUNT(DISTINCT o.order_id) as orders,
      SUM(o.order_total) as revenue
    FROM users u
    LEFT JOIN user_activities a ON u.user_id = a.user_id 
      AND a.activity_date >= DATE_TRUNC('month', CURRENT_DATE)
    LEFT JOIN orders o ON u.user_id = o.customer_id 
      AND o.order_date >= DATE_TRUNC('month', CURRENT_DATE)
      AND o.status = 'completed'
    WHERE u.status = 'active'
    GROUP BY u.user_id
  ) AS monthly_stats
  WHERE user_statistics.user_id = monthly_stats.user_id
  RETURNING user_statistics.user_id, user_statistics.monthly_revenue
)
SELECT 
  'batch_update_completed' as operation_type,
  COUNT(*) as users_updated,
  SUM(monthly_revenue) as total_monthly_revenue,
  AVG(monthly_revenue) as avg_monthly_revenue,
  MIN(monthly_revenue) as min_monthly_revenue,
  MAX(monthly_revenue) as max_monthly_revenue,
  CURRENT_TIMESTAMP as completed_at
FROM batch_processing;

-- Connection pool performance analysis and optimization
WITH pool_performance_analysis AS (
  SELECT 
    pool_name,
    current_connections,
    max_connections,
    active_connections,
    available_connections,
    total_operations,
    total_errors,
    avg_checkout_time_ms,
    error_rate_pct,
    pool_utilization_pct,

    -- Performance indicators
    CASE 
      WHEN pool_utilization_pct > 90 THEN 'HIGH_UTILIZATION'
      WHEN pool_utilization_pct < 20 THEN 'UNDERUTILIZED'  
      ELSE 'OPTIMAL_UTILIZATION'
    END as utilization_status,

    CASE
      WHEN error_rate_pct > 5 THEN 'HIGH_ERROR_RATE'
      WHEN error_rate_pct > 1 THEN 'ELEVATED_ERROR_RATE'
      ELSE 'NORMAL_ERROR_RATE'  
    END as error_status,

    CASE 
      WHEN avg_checkout_time_ms > 1000 THEN 'SLOW_CHECKOUT'
      WHEN avg_checkout_time_ms > 500 THEN 'MODERATE_CHECKOUT'
      ELSE 'FAST_CHECKOUT'
    END as checkout_performance,

    -- Optimization recommendations
    CASE 
      WHEN pool_utilization_pct > 90 AND error_rate_pct > 2 THEN 'INCREASE_POOL_SIZE'
      WHEN pool_utilization_pct < 20 AND total_operations < 100 THEN 'DECREASE_POOL_SIZE'
      WHEN avg_checkout_time_ms > 1000 THEN 'CHECK_CONNECTION_HEALTH'
      WHEN error_rate_pct > 5 THEN 'INVESTIGATE_CONNECTION_ERRORS'
      ELSE 'POOL_OPTIMALLY_CONFIGURED'
    END as optimization_recommendation

  FROM (
    SELECT 
      CONNECTION_POOL_NAME() as pool_name,
      CONNECTION_POOL_CURRENT_SIZE() as current_connections,
      CONNECTION_POOL_MAX_SIZE() as max_connections,
      CONNECTION_POOL_ACTIVE() as active_connections,
      CONNECTION_POOL_AVAILABLE() as available_connections,
      CONNECTION_POOL_OPERATION_COUNT() as total_operations,
      CONNECTION_POOL_ERROR_COUNT() as total_errors,
      CONNECTION_POOL_AVG_CHECKOUT_TIME() as avg_checkout_time_ms,
      ROUND((CONNECTION_POOL_ERROR_COUNT()::float / NULLIF(CONNECTION_POOL_OPERATION_COUNT(), 0)) * 100, 2) as error_rate_pct,
      ROUND((CONNECTION_POOL_CURRENT_SIZE()::float / CONNECTION_POOL_MAX_SIZE()) * 100, 2) as pool_utilization_pct
    FROM CONNECTION_POOLS()
  ) pool_metrics
)
SELECT 
  pool_name,
  current_connections || '/' || max_connections as pool_size,
  pool_utilization_pct || '%' as utilization,
  active_connections || ' active' as activity,
  available_connections || ' available' as availability,
  total_operations || ' ops' as operations,
  error_rate_pct || '%' as error_rate,
  avg_checkout_time_ms || 'ms' as checkout_time,

  -- Status indicators
  utilization_status,
  error_status, 
  checkout_performance,
  optimization_recommendation,

  -- Priority scoring for optimization efforts
  CASE 
    WHEN optimization_recommendation = 'INCREASE_POOL_SIZE' THEN 1
    WHEN optimization_recommendation = 'INVESTIGATE_CONNECTION_ERRORS' THEN 2
    WHEN optimization_recommendation = 'CHECK_CONNECTION_HEALTH' THEN 3
    WHEN optimization_recommendation = 'DECREASE_POOL_SIZE' THEN 4
    ELSE 5
  END as optimization_priority

FROM pool_performance_analysis
ORDER BY optimization_priority, error_rate_pct DESC, pool_utilization_pct DESC;

-- Real-time connection pool monitoring and alerting
SELECT 
  pool_name,
  health_status,
  current_connections,
  pool_utilization_pct,
  error_rate_pct,
  avg_checkout_time_ms,

  -- Generate alerts based on thresholds
  CASE 
    WHEN error_rate_pct > 10 THEN 
      'CRITICAL: High error rate (' || error_rate_pct || '%) - immediate investigation required'
    WHEN pool_utilization_pct > 95 THEN 
      'CRITICAL: Pool exhaustion (' || pool_utilization_pct || '%) - increase pool size immediately'
    WHEN avg_checkout_time_ms > 2000 THEN 
      'WARNING: Slow connection checkout (' || avg_checkout_time_ms || 'ms) - check connection health'
    WHEN error_rate_pct > 5 THEN 
      'WARNING: Elevated error rate (' || error_rate_pct || '%) - monitor closely'
    WHEN pool_utilization_pct > 85 THEN 
      'WARNING: High pool utilization (' || pool_utilization_pct || '%) - consider scaling'
    ELSE 'INFO: Pool operating normally'
  END as alert_message,

  CASE 
    WHEN error_rate_pct > 10 OR pool_utilization_pct > 95 THEN 'CRITICAL'
    WHEN error_rate_pct > 5 OR pool_utilization_pct > 85 OR avg_checkout_time_ms > 2000 THEN 'WARNING'
    ELSE 'INFO'
  END as alert_severity,

  CURRENT_TIMESTAMP as alert_timestamp

FROM (
  SELECT 
    CONNECTION_POOL_NAME() as pool_name,
    CASE 
      WHEN CONNECTION_POOL_ERROR_COUNT()::float / NULLIF(CONNECTION_POOL_OPERATION_COUNT(), 0) > 0.1 THEN 'UNHEALTHY'
      WHEN CONNECTION_POOL_CURRENT_SIZE()::float / CONNECTION_POOL_MAX_SIZE() > 0.9 THEN 'WARNING'
      WHEN CONNECTION_POOL_AVG_CHECKOUT_TIME() > 1000 THEN 'WARNING'
      ELSE 'HEALTHY'
    END as health_status,
    CONNECTION_POOL_CURRENT_SIZE() as current_connections,
    ROUND((CONNECTION_POOL_CURRENT_SIZE()::float / CONNECTION_POOL_MAX_SIZE()) * 100, 2) as pool_utilization_pct,
    ROUND((CONNECTION_POOL_ERROR_COUNT()::float / NULLIF(CONNECTION_POOL_OPERATION_COUNT(), 0)) * 100, 2) as error_rate_pct,
    CONNECTION_POOL_AVG_CHECKOUT_TIME() as avg_checkout_time_ms
  FROM CONNECTION_POOLS()
) pool_health_check
WHERE alert_severity IN ('CRITICAL', 'WARNING')  -- Only show alerts that need attention
ORDER BY 
  CASE alert_severity 
    WHEN 'CRITICAL' THEN 1
    WHEN 'WARNING' THEN 2
    ELSE 3
  END,
  error_rate_pct DESC,
  pool_utilization_pct DESC;

-- QueryLeaf provides comprehensive connection pool management:
-- 1. SQL-familiar connection pool configuration and creation
-- 2. Automatic connection lifecycle management and optimization  
-- 3. Built-in performance monitoring and health assessment
-- 4. Specialized pools for different workload patterns (read/write/batch)
-- 5. Real-time alerting and anomaly detection
-- 6. Load balancing and failover handling
-- 7. Resource utilization optimization and auto-scaling recommendations
-- 8. Integration with MongoDB driver performance features  
-- 9. Connection pool statistics and performance analytics
-- 10. Production-ready error handling and recovery mechanisms

Best Practices for Connection Pool Optimization

Design Guidelines

Essential practices for optimal connection pool configuration:

Pool Sizing Strategy: Size pools based on application concurrency patterns and database server capacity
Workload Separation: Use separate pools for read/write operations to optimize for different performance characteristics
Health Monitoring: Implement comprehensive monitoring and alerting for pool health and performance
Timeout Configuration: Set appropriate timeouts for connection establishment, operations, and idle connections
Error Handling: Implement robust error handling with automatic retry and recovery mechanisms
Resource Management: Monitor resource utilization and implement auto-scaling strategies

Performance Optimization

Optimize connection pools for maximum throughput and efficiency:

Connection Reuse: Maximize connection reuse through appropriate idle timeout configuration
Load Balancing: Distribute load across replica set members using read preferences
Compression: Enable compression for improved network efficiency and reduced bandwidth usage
Batch Operations: Use specialized batch processing pools for high-volume data operations
Resource Pooling: Pool not just connections but also prepared statements and query plans
Performance Monitoring: Continuously monitor and optimize based on real-world usage patterns

Conclusion

MongoDB connection pooling provides essential infrastructure for scalable, high-performance database applications. By implementing sophisticated connection management with automatic lifecycle handling, load balancing, and performance optimization, connection pools eliminate the overhead and complexity of per-request connection management while delivering predictable performance characteristics.

Key connection pooling benefits include:

Resource Efficiency: Optimal utilization of database connections and system resources
Predictable Performance: Consistent response times regardless of concurrent load
Automatic Management: Built-in connection lifecycle, health monitoring, and recovery
High Availability: Automatic failover and retry mechanisms for robust error handling
Scalable Architecture: Support for various deployment patterns from single-instance to globally distributed

Whether you're building high-traffic web applications, batch processing systems, multi-tenant SaaS platforms, or globally distributed services, MongoDB connection pooling with QueryLeaf's familiar SQL interface provides the foundation for robust database connectivity. This combination enables you to implement sophisticated connection management strategies while preserving familiar development patterns and operational approaches.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB connection pool configuration, monitoring, and optimization while providing SQL-familiar connection management syntax. Complex pooling strategies, performance monitoring, and resource optimization are seamlessly handled through familiar SQL patterns, making high-performance database connectivity both powerful and accessible.

The integration of advanced connection pooling with SQL-style database operations makes MongoDB an ideal platform for applications requiring both high-performance database connectivity and familiar interaction patterns, ensuring your database infrastructure remains both efficient and maintainable as it scales and evolves.

September 12, 2025
24 min read

MongoDB Aggregation Pipeline Optimization: SQL-Style Performance Tuning for Complex Data Analytics

Modern applications generate vast amounts of data requiring complex analytical processing - real-time reporting, business intelligence, data transformation, and advanced analytics. Traditional SQL databases handle complex queries through sophisticated query planners and optimization engines, but often struggle with unstructured data and horizontal scaling requirements.

MongoDB's aggregation pipeline provides powerful data processing capabilities that can handle complex analytics workloads at scale, but requires careful optimization to achieve optimal performance. Unlike traditional SQL query optimization that relies heavily on automatic query planning, MongoDB aggregation pipeline optimization requires understanding pipeline stage execution order, memory management, and strategic indexing approaches.

The Complex Analytics Performance Challenge

Traditional SQL analytics approaches face scalability and flexibility limitations:

-- Traditional SQL complex analytics - performance challenges at scale
WITH regional_sales AS (
  SELECT 
    r.region_name,
    p.category,
    p.subcategory,
    DATE_TRUNC('month', o.order_date) as month,
    SUM(oi.quantity * oi.unit_price) as gross_revenue,
    SUM(oi.quantity * p.cost_basis) as cost_of_goods,
    COUNT(DISTINCT o.customer_id) as unique_customers,
    COUNT(o.order_id) as total_orders
  FROM orders o
  JOIN order_items oi ON o.order_id = oi.order_id
  JOIN products p ON oi.product_id = p.product_id
  JOIN customers c ON o.customer_id = c.customer_id
  JOIN regions r ON c.region_id = r.region_id
  WHERE o.order_date >= '2024-01-01'
    AND o.status IN ('completed', 'shipped')
  GROUP BY r.region_name, p.category, p.subcategory, DATE_TRUNC('month', o.order_date)
),
monthly_trends AS (
  SELECT 
    region_name,
    category,
    month,
    SUM(gross_revenue) as monthly_revenue,
    SUM(cost_of_goods) as monthly_costs,
    (SUM(gross_revenue) - SUM(cost_of_goods)) as monthly_profit,
    SUM(unique_customers) as monthly_customers,
    SUM(total_orders) as monthly_orders,

    -- Window functions for trend analysis
    LAG(SUM(gross_revenue), 1) OVER (
      PARTITION BY region_name, category 
      ORDER BY month
    ) as previous_month_revenue,

    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY SUM(gross_revenue)) OVER (
      PARTITION BY region_name, category
    ) as median_monthly_revenue
  FROM regional_sales
  GROUP BY region_name, category, month
)
SELECT 
  region_name,
  category,
  month,
  monthly_revenue,
  monthly_profit,
  monthly_customers,

  -- Growth calculations
  ROUND(
    ((monthly_revenue - previous_month_revenue) / previous_month_revenue) * 100, 2
  ) as revenue_growth_percent,

  -- Performance vs median
  ROUND(
    (monthly_revenue / median_monthly_revenue) * 100, 2
  ) as performance_vs_median,

  -- Customer metrics
  ROUND(monthly_revenue / monthly_customers, 2) as revenue_per_customer,
  ROUND(monthly_orders / monthly_customers, 2) as orders_per_customer

FROM monthly_trends
WHERE month >= '2024-06-01'
ORDER BY region_name, category, month;

-- Problems with traditional approaches:
-- - Complex joins across multiple large tables
-- - Window functions require full data scanning
-- - Memory intensive for large datasets
-- - Limited horizontal scaling capabilities
-- - Rigid schema requirements
-- - Poor performance with nested/dynamic data structures
-- - Difficult to optimize for distributed processing

MongoDB aggregation pipelines provide optimized analytics processing:

// MongoDB optimized aggregation pipeline - high performance analytics
db.orders.aggregate([
  // Stage 1: Early filtering with index support
  {
    $match: {
      orderDate: { $gte: ISODate('2024-01-01') },
      status: { $in: ['completed', 'shipped'] }
    }
  },

  // Stage 2: Efficient lookup with optimized joins
  {
    $lookup: {
      from: 'customers',
      localField: 'customerId',
      foreignField: '_id',
      as: 'customer',
      pipeline: [
        { $project: { regionId: 1, _id: 0 } } // Project only needed fields
      ]
    }
  },

  // Stage 3: Unwind with preserveNullAndEmptyArrays for performance
  { $unwind: '$customer' },
  { $unwind: '$items' },

  // Stage 4: Second lookup for product data
  {
    $lookup: {
      from: 'products',
      localField: 'items.productId',
      foreignField: '_id',
      as: 'product',
      pipeline: [
        { $project: { category: 1, subcategory: 1, costBasis: 1, _id: 0 } }
      ]
    }
  },

  { $unwind: '$product' },

  // Stage 5: Third lookup for region data
  {
    $lookup: {
      from: 'regions',
      localField: 'customer.regionId',
      foreignField: '_id',
      as: 'region',
      pipeline: [
        { $project: { regionName: 1, _id: 0 } }
      ]
    }
  },

  { $unwind: '$region' },

  // Stage 6: Add computed fields efficiently
  {
    $addFields: {
      month: { 
        $dateFromParts: {
          year: { $year: '$orderDate' },
          month: { $month: '$orderDate' },
          day: 1
        }
      },
      itemRevenue: { $multiply: ['$items.quantity', '$items.unitPrice'] },
      itemCost: { $multiply: ['$items.quantity', '$product.costBasis'] }
    }
  },

  // Stage 7: Group for initial aggregation
  {
    $group: {
      _id: {
        region: '$region.regionName',
        category: '$product.category',
        subcategory: '$product.subcategory',
        month: '$month'
      },
      grossRevenue: { $sum: '$itemRevenue' },
      costOfGoods: { $sum: '$itemCost' },
      uniqueCustomers: { $addToSet: '$customerId' },
      totalOrders: { $sum: 1 }
    }
  },

  // Stage 8: Transform unique customers to count
  {
    $addFields: {
      uniqueCustomerCount: { $size: '$uniqueCustomers' }
    }
  },

  // Stage 9: Project final structure
  {
    $project: {
      region: '$_id.region',
      category: '$_id.category',
      subcategory: '$_id.subcategory',
      month: '$_id.month',
      grossRevenue: 1,
      costOfGoods: 1,
      profit: { $subtract: ['$grossRevenue', '$costOfGoods'] },
      uniqueCustomerCount: 1,
      totalOrders: 1,
      revenuePerCustomer: {
        $round: [
          { $divide: ['$grossRevenue', '$uniqueCustomerCount'] },
          2
        ]
      },
      ordersPerCustomer: {
        $round: [
          { $divide: ['$totalOrders', '$uniqueCustomerCount'] },
          2
        ]
      },
      _id: 0
    }
  },

  // Stage 10: Sort for consistent output
  {
    $sort: {
      region: 1,
      category: 1,
      month: 1
    }
  },

  // Stage 11: Add window functions for trend analysis
  {
    $setWindowFields: {
      partitionBy: { region: '$region', category: '$category' },
      sortBy: { month: 1 },
      output: {
        previousMonthRevenue: {
          $shift: {
            output: '$grossRevenue',
            by: -1
          }
        },
        medianMonthlyRevenue: {
          $median: '$grossRevenue',
          window: {
            documents: ['unbounded preceding', 'unbounded following']
          }
        },
        revenueGrowthPercent: {
          $round: [
            {
              $multiply: [
                {
                  $divide: [
                    { $subtract: ['$grossRevenue', '$previousMonthRevenue'] },
                    '$previousMonthRevenue'
                  ]
                },
                100
              ]
            },
            2
          ]
        }
      }
    }
  },

  // Stage 12: Final filtering for recent months
  {
    $match: {
      month: { $gte: ISODate('2024-06-01') }
    }
  }
], {
  // Pipeline options for optimization
  allowDiskUse: true,        // Allow spilling to disk for large datasets
  maxTimeMS: 300000,         // 5 minute timeout
  hint: { orderDate: 1, status: 1 }, // Suggest index usage
  readConcern: { level: 'majority' }  // Consistency level
});

// Benefits of optimized aggregation pipelines:
// - Early filtering reduces data volume through pipeline
// - Efficient $lookup stages with projected fields
// - Strategic index utilization
// - Memory-efficient processing with disk spilling
// - Native support for complex analytical operations
// - Horizontal scaling across shards
// - Flexible handling of nested/dynamic data
// - Built-in window functions for trend analysis

Understanding MongoDB Aggregation Pipeline Performance

Pipeline Stage Optimization and Ordering

Implement strategic pipeline stage ordering for optimal performance:

// Advanced aggregation pipeline optimization patterns
class AggregationOptimizer {
  constructor(db) {
    this.db = db;
    this.performanceMetrics = new Map();
    this.indexRecommendations = [];
  }

  async optimizeEarlyFiltering(collection, pipeline) {
    // Move filtering stages as early as possible
    const optimizedPipeline = [];
    const filterStages = [];
    const nonFilterStages = [];

    // Separate filter stages from other stages
    pipeline.forEach(stage => {
      const stageType = Object.keys(stage)[0];
      if (stageType === '$match' || stageType === '$limit') {
        filterStages.push(stage);
      } else {
        nonFilterStages.push(stage);
      }
    });

    // Early filtering reduces document flow through pipeline
    optimizedPipeline.push(...filterStages);
    optimizedPipeline.push(...nonFilterStages);

    return optimizedPipeline;
  }

  async createProjectionOptimizedPipeline(baseCollection, lookupCollections, projections) {
    // Optimize projections and lookups for minimal data transfer
    return [
      // Stage 1: Early projection to reduce document size
      {
        $project: {
          // Only include fields needed for subsequent stages
          ...projections.baseFields,
          // Include fields needed for lookups
          ...projections.lookupKeys
        }
      },

      // Stage 2: Optimized lookups with sub-pipelines
      ...lookupCollections.map(lookup => ({
        $lookup: {
          from: lookup.collection,
          localField: lookup.localField,
          foreignField: lookup.foreignField,
          as: lookup.as,
          pipeline: [
            // Project only needed fields in lookup
            { $project: lookup.projection },
            // Add filters within lookup when possible
            ...(lookup.filters ? [{ $match: lookup.filters }] : [])
          ]
        }
      })),

      // Stage 3: Unwind with null preservation for performance
      ...lookupCollections.map(lookup => ({
        $unwind: {
          path: `$${lookup.as}`,
          preserveNullAndEmptyArrays: lookup.preserveNulls || false
        }
      })),

      // Stage 4: Final projection after all joins
      {
        $project: projections.finalFields
      }
    ];
  }

  async analyzeIndexUsage(collection, pipeline) {
    // Analyze pipeline for index optimization opportunities
    const explanation = await this.db.collection(collection).aggregate(
      pipeline,
      { explain: true }
    ).toArray();

    const indexAnalysis = {
      stagesAnalyzed: [],
      indexesUsed: [],
      indexesRecommended: [],
      performanceIssues: []
    };

    // Analyze each stage for index usage
    explanation.forEach((stage, index) => {
      const stageType = Object.keys(pipeline[index])[0];
      const stageAnalysis = {
        stage: index,
        type: stageType,
        indexUsed: false,
        collectionScanned: false,
        documentsExamined: 0,
        documentsReturned: 0
      };

      if (stage.executionStats) {
        stageAnalysis.indexUsed = stage.executionStats.executionTimeMillisEstimate < 100;
        stageAnalysis.documentsExamined = stage.executionStats.totalDocsExamined;
        stageAnalysis.documentsReturned = stage.executionStats.totalDocsReturned;

        // Identify inefficient stages
        if (stageAnalysis.documentsExamined > stageAnalysis.documentsReturned * 10) {
          indexAnalysis.performanceIssues.push({
            stage: index,
            issue: 'high_document_examination_ratio',
            ratio: stageAnalysis.documentsExamined / stageAnalysis.documentsReturned,
            recommendation: 'Consider adding index for this stage'
          });
        }
      }

      indexAnalysis.stagesAnalyzed.push(stageAnalysis);
    });

    return indexAnalysis;
  }

  async createPerformanceOptimizedPipeline(collection, analyticsQuery) {
    // Create comprehensive performance-optimized pipeline
    const pipeline = [
      // Stage 1: Efficient date range filtering with index
      {
        $match: {
          [analyticsQuery.dateField]: {
            $gte: analyticsQuery.startDate,
            $lte: analyticsQuery.endDate
          },
          // Add compound index filters
          ...analyticsQuery.filters
        }
      },

      // Stage 2: Early sampling for large datasets (if needed)
      ...(analyticsQuery.sampleSize ? [{
        $sample: { size: analyticsQuery.sampleSize }
      }] : []),

      // Stage 3: Efficient faceted search
      {
        $facet: {
          // Main aggregation pipeline
          data: [
            // Lookup with optimized sub-pipeline
            {
              $lookup: {
                from: analyticsQuery.lookupCollection,
                localField: analyticsQuery.localField,
                foreignField: analyticsQuery.foreignField,
                as: 'lookupData',
                pipeline: [
                  { $project: analyticsQuery.lookupProjection },
                  { $limit: 1 } // Limit lookup results when appropriate
                ]
              }
            },

            { $unwind: '$lookupData' },

            // Grouping with efficient accumulators
            {
              $group: {
                _id: analyticsQuery.groupBy,

                // Use $sum for counting instead of $addToSet when possible
                totalCount: { $sum: 1 },
                totalValue: { $sum: analyticsQuery.valueField },
                averageValue: { $avg: analyticsQuery.valueField },

                // Efficient min/max calculations
                minValue: { $min: analyticsQuery.valueField },
                maxValue: { $max: analyticsQuery.valueField },

                // Use $push only when needed for arrays
                ...(analyticsQuery.collectArrays ? {
                  samples: { $push: analyticsQuery.sampleField }
                } : {})
              }
            },

            // Add calculated fields
            {
              $addFields: {
                efficiency: {
                  $round: [
                    { $divide: ['$totalValue', '$totalCount'] },
                    2
                  ]
                },
                valueRange: { $subtract: ['$maxValue', '$minValue'] }
              }
            },

            // Sort for consistent results
            { $sort: { totalValue: -1 } },

            // Limit results to prevent memory issues
            { $limit: analyticsQuery.maxResults || 1000 }
          ],

          // Metadata pipeline for counts and statistics
          metadata: [
            {
              $group: {
                _id: null,
                totalDocuments: { $sum: 1 },
                totalValue: { $sum: analyticsQuery.valueField },
                avgValue: { $avg: analyticsQuery.valueField }
              }
            }
          ]
        }
      },

      // Stage 4: Combine faceted results
      {
        $project: {
          data: 1,
          metadata: { $arrayElemAt: ['$metadata', 0] },
          processingTimestamp: new Date()
        }
      }
    ];

    return pipeline;
  }

  async benchmarkPipeline(collection, pipeline, options = {}) {
    // Comprehensive pipeline performance benchmarking
    const benchmarkResults = {
      pipelineName: options.name || 'unnamed_pipeline',
      startTime: new Date(),
      stages: [],
      totalExecutionTime: 0,
      documentsProcessed: 0,
      memoryUsage: 0,
      indexesUsed: [],
      recommendations: []
    };

    try {
      // Get execution statistics
      const startTime = Date.now();
      const explanation = await this.db.collection(collection).aggregate(
        pipeline,
        { 
          explain: true,
          allowDiskUse: true,
          ...options
        }
      ).toArray();

      // Analyze execution plan
      explanation.forEach((stageExplan, index) => {
        const stageBenchmark = {
          stageIndex: index,
          stageType: Object.keys(pipeline[index])[0],
          executionTimeMs: stageExplan.executionStats?.executionTimeMillisEstimate || 0,
          documentsIn: stageExplan.executionStats?.totalDocsExamined || 0,
          documentsOut: stageExplan.executionStats?.totalDocsReturned || 0,
          indexUsed: stageExplan.executionStats?.inputStage?.stage === 'IXSCAN',
          memoryUsageBytes: stageExplan.executionStats?.memUsage || 0
        };

        benchmarkResults.stages.push(stageBenchmark);
        benchmarkResults.totalExecutionTime += stageBenchmark.executionTimeMs;
        benchmarkResults.memoryUsage += stageBenchmark.memoryUsageBytes;
      });

      // Run actual pipeline for real-world timing
      const realStartTime = Date.now();
      const results = await this.db.collection(collection).aggregate(
        pipeline,
        { allowDiskUse: true, ...options }
      ).toArray();

      const realExecutionTime = Date.now() - realStartTime;
      benchmarkResults.realExecutionTime = realExecutionTime;
      benchmarkResults.documentsProcessed = results.length;

      // Generate recommendations
      benchmarkResults.recommendations = this.generateOptimizationRecommendations(
        benchmarkResults
      );

    } catch (error) {
      benchmarkResults.error = error.message;
    } finally {
      benchmarkResults.endTime = new Date();
    }

    // Store benchmark results for comparison
    this.performanceMetrics.set(benchmarkResults.pipelineName, benchmarkResults);

    return benchmarkResults;
  }

  generateOptimizationRecommendations(benchmarkResults) {
    const recommendations = [];

    // Check for stages without index usage
    benchmarkResults.stages.forEach((stage, index) => {
      if (!stage.indexUsed && stage.documentsIn > 1000) {
        recommendations.push({
          type: 'index_recommendation',
          stage: index,
          message: `Consider adding index for stage ${index} (${stage.stageType})`,
          priority: 'high',
          potentialImprovement: 'significant'
        });
      }

      if (stage.documentsIn > stage.documentsOut * 100) {
        recommendations.push({
          type: 'filtering_recommendation',
          stage: index,
          message: `Move filtering earlier in pipeline for stage ${index}`,
          priority: 'medium',
          potentialImprovement: 'moderate'
        });
      }
    });

    // Memory usage recommendations
    if (benchmarkResults.memoryUsage > 100 * 1024 * 1024) { // 100MB
      recommendations.push({
        type: 'memory_optimization',
        message: 'High memory usage detected - consider using allowDiskUse: true',
        priority: 'medium',
        potentialImprovement: 'prevents memory errors'
      });
    }

    // Execution time recommendations
    if (benchmarkResults.totalExecutionTime > 30000) { // 30 seconds
      recommendations.push({
        type: 'performance_optimization',
        message: 'Long execution time - review pipeline optimization opportunities',
        priority: 'high',
        potentialImprovement: 'significant'
      });
    }

    return recommendations;
  }

  async createIndexRecommendations(collection, commonPipelines) {
    // Generate index recommendations based on common pipeline patterns
    const recommendations = [];

    for (const pipeline of commonPipelines) {
      const analysis = await this.analyzeIndexUsage(collection, pipeline.stages);

      pipeline.stages.forEach((stage, index) => {
        const stageType = Object.keys(stage)[0];

        switch (stageType) {
          case '$match':
            const matchFields = Object.keys(stage.$match);
            if (matchFields.length > 0) {
              recommendations.push({
                type: 'compound_index',
                collection: collection,
                fields: matchFields,
                reason: `Optimize $match stage ${index}`,
                estimatedImprovement: 'high'
              });
            }
            break;

          case '$sort':
            const sortFields = Object.keys(stage.$sort);
            recommendations.push({
              type: 'sort_index',
              collection: collection,
              fields: sortFields,
              reason: `Optimize $sort stage ${index}`,
              estimatedImprovement: 'high'
            });
            break;

          case '$group':
            const groupField = stage.$group._id;
            if (typeof groupField === 'string' && groupField.startsWith('$')) {
              recommendations.push({
                type: 'grouping_index',
                collection: collection,
                fields: [groupField.substring(1)],
                reason: `Optimize $group stage ${index}`,
                estimatedImprovement: 'medium'
              });
            }
            break;
        }
      });
    }

    // Deduplicate and prioritize recommendations
    return this.prioritizeIndexRecommendations(recommendations);
  }

  prioritizeIndexRecommendations(recommendations) {
    // Remove duplicates and prioritize by impact
    const uniqueRecommendations = new Map();

    recommendations.forEach(rec => {
      const key = `${rec.collection}_${rec.fields.join('_')}`;
      const existing = uniqueRecommendations.get(key);

      if (!existing || this.getImpactScore(rec) > this.getImpactScore(existing)) {
        uniqueRecommendations.set(key, rec);
      }
    });

    return Array.from(uniqueRecommendations.values())
      .sort((a, b) => this.getImpactScore(b) - this.getImpactScore(a));
  }

  getImpactScore(recommendation) {
    const impactScores = {
      high: 3,
      medium: 2,
      low: 1
    };
    return impactScores[recommendation.estimatedImprovement] || 0;
  }

  async generatePerformanceReport() {
    // Generate comprehensive performance analysis report
    const report = {
      generatedAt: new Date(),
      totalPipelinesAnalyzed: this.performanceMetrics.size,
      performanceSummary: {
        fastPipelines: 0,      // < 1 second
        moderatePipelines: 0,  // 1-10 seconds
        slowPipelines: 0       // > 10 seconds
      },
      topPerformers: [],
      performanceIssues: [],
      indexRecommendations: [],
      overallRecommendations: []
    };

    // Analyze all benchmarked pipelines
    for (const [name, metrics] of this.performanceMetrics.entries()) {
      const executionTime = metrics.realExecutionTime || metrics.totalExecutionTime;

      if (executionTime < 1000) {
        report.performanceSummary.fastPipelines++;
      } else if (executionTime < 10000) {
        report.performanceSummary.moderatePipelines++;
      } else {
        report.performanceSummary.slowPipelines++;
      }

      // Identify top performers and issues
      if (executionTime < 500 && metrics.documentsProcessed > 1000) {
        report.topPerformers.push({
          name: name,
          executionTime: executionTime,
          documentsProcessed: metrics.documentsProcessed,
          efficiency: metrics.documentsProcessed / executionTime
        });
      }

      if (executionTime > 30000 || metrics.memoryUsage > 500 * 1024 * 1024) {
        report.performanceIssues.push({
          name: name,
          executionTime: executionTime,
          memoryUsage: metrics.memoryUsage,
          recommendations: metrics.recommendations
        });
      }
    }

    // Sort top performers by efficiency
    report.topPerformers.sort((a, b) => b.efficiency - a.efficiency);

    // Generate overall recommendations
    if (report.performanceSummary.slowPipelines > 0) {
      report.overallRecommendations.push(
        'Multiple slow pipelines detected - prioritize optimization efforts'
      );
    }

    if (this.indexRecommendations.length > 5) {
      report.overallRecommendations.push(
        'Consider implementing recommended indexes to improve query performance'
      );
    }

    return report;
  }
}

Memory Management and Disk Spilling

Implement efficient memory management for large aggregations:

// Advanced memory management and optimization strategies
class AggregationMemoryManager {
  constructor(db) {
    this.db = db;
    this.memoryThresholds = {
      warning: 100 * 1024 * 1024,    // 100MB
      critical: 500 * 1024 * 1024,   // 500MB
      maximum: 1024 * 1024 * 1024    // 1GB
    };
  }

  async createMemoryEfficientPipeline(collection, aggregationConfig) {
    // Design pipeline with memory efficiency in mind
    const memoryOptimizedPipeline = [
      // Stage 1: Early filtering to reduce dataset size
      {
        $match: {
          ...aggregationConfig.filters,
          // Add indexed filters first
          [aggregationConfig.dateField]: {
            $gte: aggregationConfig.startDate,
            $lte: aggregationConfig.endDate
          }
        }
      },

      // Stage 2: Project only necessary fields early
      {
        $project: {
          // Include only fields needed for processing
          ...aggregationConfig.requiredFields,
          // Exclude large text fields unless necessary
          ...(aggregationConfig.excludeFields.reduce((acc, field) => {
            acc[field] = 0;
            return acc;
          }, {}))
        }
      },

      // Stage 3: Use streaming-friendly operations
      {
        $group: {
          _id: aggregationConfig.groupBy,

          // Use memory-efficient accumulators
          count: { $sum: 1 },
          totalValue: { $sum: aggregationConfig.valueField },

          // Avoid $addToSet for large arrays - use $mergeObjects for smaller sets
          ...(aggregationConfig.collectSets && aggregationConfig.expectedSetSize < 1000 ? {
            uniqueValues: { $addToSet: aggregationConfig.setField }
          } : {}),

          // Use $first/$last instead of $push for single values
          firstValue: { $first: aggregationConfig.valueField },
          lastValue: { $last: aggregationConfig.valueField },

          // Calculated fields at group level to avoid later processing
          averageValue: { $avg: aggregationConfig.valueField }
        }
      },

      // Stage 4: Add computed fields efficiently
      {
        $addFields: {
          efficiency: {
            $cond: {
              if: { $gt: ['$count', 0] },
              then: { $divide: ['$totalValue', '$count'] },
              else: 0
            }
          },

          // Avoid complex calculations on large arrays
          setSize: {
            $cond: {
              if: { $isArray: '$uniqueValues' },
              then: { $size: '$uniqueValues' },
              else: 0
            }
          }
        }
      },

      // Stage 5: Sort with limit to prevent large result sets
      { $sort: { totalValue: -1 } },
      { $limit: aggregationConfig.maxResults || 10000 },

      // Stage 6: Final projection to minimize output size
      {
        $project: {
          groupKey: '$_id',
          metrics: {
            count: '$count',
            totalValue: '$totalValue',
            averageValue: '$averageValue',
            efficiency: '$efficiency'
          },
          _id: 0
        }
      }
    ];

    return memoryOptimizedPipeline;
  }

  async processLargeDatasetWithBatching(collection, pipeline, batchConfig) {
    // Process large datasets in batches to manage memory
    const results = [];
    const batchSize = batchConfig.batchSize || 10000;
    const totalBatches = Math.ceil(batchConfig.totalDocuments / batchSize);

    console.log(`Processing ${batchConfig.totalDocuments} documents in ${totalBatches} batches`);

    for (let batch = 0; batch < totalBatches; batch++) {
      const skip = batch * batchSize;

      const batchPipeline = [
        // Add skip and limit for batching
        { $skip: skip },
        { $limit: batchSize },

        // Original pipeline stages
        ...pipeline
      ];

      try {
        const batchResults = await this.db.collection(collection).aggregate(
          batchPipeline,
          {
            allowDiskUse: true,
            maxTimeMS: 60000, // 1 minute per batch
            readConcern: { level: 'available' } // Use available for better performance
          }
        ).toArray();

        results.push(...batchResults);

        console.log(`Completed batch ${batch + 1}/${totalBatches} (${batchResults.length} results)`);

        // Optional: Add delay between batches to reduce load
        if (batchConfig.delayMs && batch < totalBatches - 1) {
          await new Promise(resolve => setTimeout(resolve, batchConfig.delayMs));
        }

      } catch (error) {
        console.error(`Batch ${batch + 1} failed:`, error.message);

        // Optionally continue with remaining batches
        if (batchConfig.continueOnError) {
          continue;
        } else {
          throw error;
        }
      }
    }

    return results;
  }

  async createStreamingAggregation(collection, pipeline, outputHandler) {
    // Create streaming aggregation for real-time processing
    const cursor = this.db.collection(collection).aggregate(pipeline, {
      allowDiskUse: true,
      batchSize: 1000, // Small batch size for streaming
      readConcern: { level: 'available' }
    });

    const streamingStats = {
      documentsProcessed: 0,
      startTime: new Date(),
      memoryPeakUsage: 0,
      batchesProcessed: 0
    };

    try {
      while (await cursor.hasNext()) {
        const document = await cursor.next();

        // Process document through handler
        await outputHandler(document, streamingStats);

        streamingStats.documentsProcessed++;

        // Monitor memory usage (approximate)
        if (streamingStats.documentsProcessed % 1000 === 0) {
          const memoryUsage = process.memoryUsage();
          streamingStats.memoryPeakUsage = Math.max(
            streamingStats.memoryPeakUsage,
            memoryUsage.heapUsed
          );

          console.log(`Processed ${streamingStats.documentsProcessed} documents, Memory: ${Math.round(memoryUsage.heapUsed / 1024 / 1024)}MB`);
        }
      }

    } finally {
      await cursor.close();
      streamingStats.endTime = new Date();
      streamingStats.totalProcessingTime = streamingStats.endTime - streamingStats.startTime;
    }

    return streamingStats;
  }

  async optimizePipelineForLargeArrays(collection, pipeline, arrayOptimizations) {
    // Optimize pipelines that work with large arrays
    const optimizedPipeline = [];

    pipeline.forEach((stage, index) => {
      const stageType = Object.keys(stage)[0];

      switch (stageType) {
        case '$unwind':
          // Add preserveNullAndEmptyArrays and includeArrayIndex for efficiency
          optimizedPipeline.push({
            $unwind: {
              path: stage.$unwind.path || stage.$unwind,
              preserveNullAndEmptyArrays: true,
              includeArrayIndex: `${stage.$unwind.path || stage.$unwind}_index`
            }
          });
          break;

        case '$group':
          // Optimize group operations for array handling
          const groupStage = { ...stage };

          // Replace $addToSet with $mergeObjects for better performance when possible
          Object.keys(groupStage.$group).forEach(key => {
            if (key !== '_id') {
              const accumulator = groupStage.$group[key];

              if (accumulator.$addToSet && arrayOptimizations.convertAddToSetToMerge) {
                // Convert to more efficient operation when possible
                groupStage.$group[key] = { $push: accumulator.$addToSet };
              }
            }
          });

          optimizedPipeline.push(groupStage);
          break;

        case '$project':
          // Optimize array operations in projection
          const projectStage = { ...stage };

          Object.keys(projectStage.$project).forEach(key => {
            const projection = projectStage.$project[key];

            // Replace array operations with more efficient alternatives
            if (projection && typeof projection === 'object' && projection.$size) {
              // $size can be expensive on very large arrays
              if (arrayOptimizations.approximateArraySizes) {
                projectStage.$project[`${key}_approx`] = {
                  $cond: {
                    if: { $isArray: projection.$size },
                    then: { $min: [{ $size: projection.$size }, 10000] }, // Cap at 10k
                    else: 0
                  }
                };
              }
            }
          });

          optimizedPipeline.push(projectStage);
          break;

        default:
          optimizedPipeline.push(stage);
      }
    });

    // Add array-specific optimizations
    if (arrayOptimizations.limitArrayProcessing) {
      // Add $limit stages after $unwind to prevent processing too many array elements
      optimizedPipeline.forEach((stage, index) => {
        if (stage.$unwind && index < optimizedPipeline.length - 1) {
          optimizedPipeline.splice(index + 1, 0, {
            $limit: arrayOptimizations.maxArrayElements || 100000
          });
        }
      });
    }

    return optimizedPipeline;
  }

  async monitorAggregationPerformance(collection, pipeline, options = {}) {
    // Comprehensive performance monitoring for aggregations
    const performanceMonitor = {
      startTime: new Date(),
      memorySnapshots: [],
      stageTimings: [],
      resourceUsage: {
        cpuStart: process.cpuUsage(),
        memoryStart: process.memoryUsage()
      }
    };

    // Function to take memory snapshots
    const takeMemorySnapshot = () => {
      const memoryUsage = process.memoryUsage();
      performanceMonitor.memorySnapshots.push({
        timestamp: new Date(),
        heapUsed: memoryUsage.heapUsed,
        heapTotal: memoryUsage.heapTotal,
        external: memoryUsage.external,
        rss: memoryUsage.rss
      });
    };

    // Take initial snapshot
    takeMemorySnapshot();

    try {
      let results;

      if (options.explain) {
        // Get execution plan with timing
        results = await this.db.collection(collection).aggregate(
          pipeline,
          { 
            explain: true,
            allowDiskUse: options.allowDiskUse || true,
            maxTimeMS: options.maxTimeMS || 300000
          }
        ).toArray();

        // Analyze execution plan
        results.forEach((stageExplan, index) => {
          performanceMonitor.stageTimings.push({
            stage: index,
            type: Object.keys(pipeline[index])[0],
            executionTimeMs: stageExplan.executionStats?.executionTimeMillisEstimate || 0,
            documentsIn: stageExplan.executionStats?.totalDocsExamined || 0,
            documentsOut: stageExplan.executionStats?.totalDocsReturned || 0
          });
        });

      } else {
        // Execute actual pipeline with monitoring
        const monitoringInterval = setInterval(takeMemorySnapshot, 5000); // Every 5 seconds

        try {
          results = await this.db.collection(collection).aggregate(
            pipeline,
            {
              allowDiskUse: options.allowDiskUse || true,
              maxTimeMS: options.maxTimeMS || 300000,
              batchSize: options.batchSize || 1000
            }
          ).toArray();

        } finally {
          clearInterval(monitoringInterval);
        }
      }

      // Take final snapshot
      takeMemorySnapshot();

      // Calculate performance metrics
      const endTime = new Date();
      const totalTime = endTime - performanceMonitor.startTime;
      const finalCpuUsage = process.cpuUsage(performanceMonitor.resourceUsage.cpuStart);
      const finalMemoryUsage = process.memoryUsage();

      performanceMonitor.summary = {
        totalExecutionTime: totalTime,
        documentsReturned: results.length,
        avgMemoryUsage: performanceMonitor.memorySnapshots.reduce(
          (sum, snapshot) => sum + snapshot.heapUsed, 0
        ) / performanceMonitor.memorySnapshots.length,
        peakMemoryUsage: Math.max(
          ...performanceMonitor.memorySnapshots.map(s => s.heapUsed)
        ),
        cpuUserTime: finalCpuUsage.user / 1000, // Convert to milliseconds
        cpuSystemTime: finalCpuUsage.system / 1000,
        memoryDifference: finalMemoryUsage.heapUsed - performanceMonitor.resourceUsage.memoryStart.heapUsed
      };

      return {
        results: results,
        performanceData: performanceMonitor
      };

    } catch (error) {
      performanceMonitor.error = error.message;
      throw error;
    }
  }

  async optimizeForShardedCollection(collection, pipeline, shardingConfig) {
    // Optimize pipeline for sharded collections
    const shardOptimizedPipeline = [];

    // Add shard key filtering early if possible
    if (shardingConfig.shardKey && shardingConfig.shardKeyValues) {
      shardOptimizedPipeline.push({
        $match: {
          [shardingConfig.shardKey]: {
            $in: shardingConfig.shardKeyValues
          }
        }
      });
    }

    pipeline.forEach((stage, index) => {
      const stageType = Object.keys(stage)[0];

      switch (stageType) {
        case '$group':
          // Ensure group operations can be parallelized across shards
          const groupStage = { ...stage };

          // Add shard key to group _id when possible for better parallelization
          if (typeof groupStage.$group._id === 'object' && shardingConfig.includeShardKeyInGroup) {
            groupStage.$group._id[shardingConfig.shardKey] = `$${shardingConfig.shardKey}`;
          }

          shardOptimizedPipeline.push(groupStage);
          break;

        case '$sort':
          // Optimize sort for sharded collections
          const sortStage = { ...stage };

          // Include shard key in sort to prevent scatter-gather when possible
          if (shardingConfig.includeShardKeyInSort) {
            sortStage.$sort = {
              [shardingConfig.shardKey]: 1,
              ...sortStage.$sort
            };
          }

          shardOptimizedPipeline.push(sortStage);
          break;

        case '$lookup':
          // Optimize lookups for sharded collections
          const lookupStage = { ...stage };

          // Add hint to use shard key when doing lookups
          if (shardingConfig.optimizeLookups) {
            lookupStage.$lookup.pipeline = lookupStage.$lookup.pipeline || [];
            lookupStage.$lookup.pipeline.unshift({
              $match: {
                // Add efficient filters in lookup pipeline
              }
            });
          }

          shardOptimizedPipeline.push(lookupStage);
          break;

        default:
          shardOptimizedPipeline.push(stage);
      }
    });

    return shardOptimizedPipeline;
  }
}

Advanced Aggregation Patterns and Optimizations

Complex Analytics with Window Functions

Implement sophisticated analytics using MongoDB's window functions:

// Advanced analytics patterns with window functions and time-series analysis
class AdvancedAnalyticsEngine {
  constructor(db) {
    this.db = db;
    this.analysisCache = new Map();
  }

  async createTimeSeriesAnalysisPipeline(collection, timeSeriesConfig) {
    // Advanced time-series analysis with window functions
    return [
      // Stage 1: Filter and prepare time series data
      {
        $match: {
          [timeSeriesConfig.timestampField]: {
            $gte: timeSeriesConfig.startDate,
            $lte: timeSeriesConfig.endDate
          },
          ...timeSeriesConfig.filters
        }
      },

      // Stage 2: Add time bucket fields for grouping
      {
        $addFields: {
          timeBucket: {
            $dateTrunc: {
              date: `$${timeSeriesConfig.timestampField}`,
              unit: timeSeriesConfig.timeUnit, // 'hour', 'day', 'week', 'month'
              binSize: timeSeriesConfig.binSize || 1
            }
          },

          // Extract time components for analysis
          hour: { $hour: `$${timeSeriesConfig.timestampField}` },
          dayOfWeek: { $dayOfWeek: `$${timeSeriesConfig.timestampField}` },
          dayOfMonth: { $dayOfMonth: `$${timeSeriesConfig.timestampField}` },
          month: { $month: `$${timeSeriesConfig.timestampField}` },
          year: { $year: `$${timeSeriesConfig.timestampField}` }
        }
      },

      // Stage 3: Group by time bucket and dimensions
      {
        $group: {
          _id: {
            timeBucket: '$timeBucket',
            // Add dimensional grouping
            ...timeSeriesConfig.dimensions.reduce((acc, dim) => {
              acc[dim] = `$${dim}`;
              return acc;
            }, {})
          },

          // Aggregate metrics
          totalValue: { $sum: `$${timeSeriesConfig.valueField}` },
          count: { $sum: 1 },
          averageValue: { $avg: `$${timeSeriesConfig.valueField}` },
          minValue: { $min: `$${timeSeriesConfig.valueField}` },
          maxValue: { $max: `$${timeSeriesConfig.valueField}` },

          // Collect samples for percentile calculations
          values: { $push: `$${timeSeriesConfig.valueField}` },

          // Time pattern analysis
          hourDistribution: {
            $push: {
              hour: '$hour',
              value: `$${timeSeriesConfig.valueField}`
            }
          }
        }
      },

      // Stage 4: Add calculated fields and percentiles
      {
        $addFields: {
          // Calculate percentiles from collected values
          p50: { $percentile: { input: '$values', p: [0.5], method: 'approximate' } },
          p90: { $percentile: { input: '$values', p: [0.9], method: 'approximate' } },
          p95: { $percentile: { input: '$values', p: [0.95], method: 'approximate' } },
          p99: { $percentile: { input: '$values', p: [0.99], method: 'approximate' } },

          // Calculate variance and standard deviation
          variance: { $stdDevPop: '$values' },

          // Calculate value range
          valueRange: { $subtract: ['$maxValue', '$minValue'] },

          // Calculate coefficient of variation
          coefficientOfVariation: {
            $cond: {
              if: { $gt: ['$averageValue', 0] },
              then: { 
                $divide: [
                  { $stdDevPop: '$values' },
                  '$averageValue'
                ]
              },
              else: 0
            }
          }
        }
      },

      // Stage 5: Sort by time for window function processing
      {
        $sort: {
          '_id.timeBucket': 1,
          ...timeSeriesConfig.dimensions.reduce((acc, dim) => {
            acc[`_id.${dim}`] = 1;
            return acc;
          }, {})
        }
      },

      // Stage 6: Apply window functions for trend analysis
      {
        $setWindowFields: {
          partitionBy: timeSeriesConfig.dimensions.reduce((acc, dim) => {
            acc[dim] = `$_id.${dim}`;
            return acc;
          }, {}),
          sortBy: { '_id.timeBucket': 1 },
          output: {
            // Moving averages
            movingAvg7: {
              $avg: '$totalValue',
              window: {
                documents: [-6, 0] // 7-period moving average
              }
            },
            movingAvg30: {
              $avg: '$totalValue',
              window: {
                documents: [-29, 0] // 30-period moving average
              }
            },

            // Growth calculations
            previousPeriodValue: {
              $shift: {
                output: '$totalValue',
                by: -1
              }
            },

            // Cumulative calculations
            cumulativeSum: {
              $sum: '$totalValue',
              window: {
                documents: ['unbounded preceding', 'current']
              }
            },

            // Rank and dense rank
            valueRank: {
              $rank: {},
              window: {
                documents: ['unbounded preceding', 'unbounded following']
              }
            },

            // Min/Max within window
            windowMin: {
              $min: '$totalValue',
              window: {
                documents: [-6, 6] // 13-period window
              }
            },
            windowMax: {
              $max: '$totalValue',
              window: {
                documents: [-6, 6] // 13-period window
              }
            },

            // Calculate period-over-period changes
            periodChange: {
              $subtract: [
                '$totalValue',
                { $shift: { output: '$totalValue', by: -1 } }
              ]
            },

            // Volatility measures
            volatility: {
              $stdDevPop: '$totalValue',
              window: {
                documents: [-29, 0] // 30-period volatility
              }
            }
          }
        }
      },

      // Stage 7: Calculate derived metrics
      {
        $addFields: {
          // Growth rates
          periodGrowthRate: {
            $cond: {
              if: { $gt: ['$previousPeriodValue', 0] },
              then: {
                $multiply: [
                  { $divide: ['$periodChange', '$previousPeriodValue'] },
                  100
                ]
              },
              else: null
            }
          },

          // Trend indicators
          trendDirection: {
            $cond: {
              if: { $gt: ['$totalValue', '$movingAvg7'] },
              then: 'up',
              else: {
                $cond: {
                  if: { $lt: ['$totalValue', '$movingAvg7'] },
                  then: 'down',
                  else: 'stable'
                }
              }
            }
          },

          // Anomaly detection (simple z-score based)
          zScore: {
            $cond: {
              if: { $gt: ['$volatility', 0] },
              then: {
                $divide: [
                  { $subtract: ['$totalValue', '$movingAvg30'] },
                  '$volatility'
                ]
              },
              else: 0
            }
          },

          // Position within window range
          positionInRange: {
            $cond: {
              if: { $gt: [{ $subtract: ['$windowMax', '$windowMin'] }, 0] },
              then: {
                $multiply: [
                  {
                    $divide: [
                      { $subtract: ['$totalValue', '$windowMin'] },
                      { $subtract: ['$windowMax', '$windowMin'] }
                    ]
                  },
                  100
                ]
              },
              else: 50
            }
          }
        }
      },

      // Stage 8: Add anomaly flags
      {
        $addFields: {
          isAnomaly: {
            $or: [
              { $gt: ['$zScore', 2.5] }, // High anomaly
              { $lt: ['$zScore', -2.5] } // Low anomaly
            ]
          },
          anomalyLevel: {
            $cond: {
              if: { $gt: [{ $abs: '$zScore' }, 3] },
              then: 'extreme',
              else: {
                $cond: {
                  if: { $gt: [{ $abs: '$zScore' }, 2] },
                  then: 'high',
                  else: 'normal'
                }
              }
            }
          }
        }
      },

      // Stage 9: Final projection with clean structure
      {
        $project: {
          // Time dimension
          timeBucket: '$_id.timeBucket',

          // Other dimensions
          ...timeSeriesConfig.dimensions.reduce((acc, dim) => {
            acc[dim] = `$_id.${dim}`;
            return acc;
          }, {}),

          // Core metrics
          metrics: {
            totalValue: { $round: ['$totalValue', 2] },
            count: '$count',
            averageValue: { $round: ['$averageValue', 2] },
            minValue: '$minValue',
            maxValue: '$maxValue',
            valueRange: '$valueRange'
          },

          // Statistical measures
          statistics: {
            p50: { $arrayElemAt: ['$p50', 0] },
            p90: { $arrayElemAt: ['$p90', 0] },
            p95: { $arrayElemAt: ['$p95', 0] },
            p99: { $arrayElemAt: ['$p99', 0] },
            variance: { $round: ['$variance', 2] },
            coefficientOfVariation: { $round: ['$coefficientOfVariation', 4] }
          },

          // Trend analysis
          trends: {
            movingAvg7: { $round: ['$movingAvg7', 2] },
            movingAvg30: { $round: ['$movingAvg30', 2] },
            periodChange: { $round: ['$periodChange', 2] },
            periodGrowthRate: { $round: ['$periodGrowthRate', 2] },
            trendDirection: '$trendDirection',
            cumulativeSum: { $round: ['$cumulativeSum', 2] }
          },

          // Anomaly detection
          anomalies: {
            zScore: { $round: ['$zScore', 3] },
            isAnomaly: '$isAnomaly',
            anomalyLevel: '$anomalyLevel',
            positionInRange: { $round: ['$positionInRange', 1] }
          },

          // Rankings
          rankings: {
            valueRank: '$valueRank',
            volatility: { $round: ['$volatility', 2] }
          },

          _id: 0
        }
      },

      // Stage 10: Sort final results
      {
        $sort: {
          timeBucket: 1,
          ...timeSeriesConfig.dimensions.reduce((acc, dim) => {
            acc[dim] = 1;
            return acc;
          }, {})
        }
      }
    ];
  }

  async createCohortAnalysisPipeline(collection, cohortConfig) {
    // Advanced cohort analysis for user behavior tracking
    return [
      // Stage 1: Filter and prepare user event data
      {
        $match: {
          [cohortConfig.eventDateField]: {
            $gte: cohortConfig.startDate,
            $lte: cohortConfig.endDate
          },
          [cohortConfig.eventTypeField]: { $in: cohortConfig.eventTypes }
        }
      },

      // Stage 2: Determine cohort assignment based on first event
      {
        $group: {
          _id: `$${cohortConfig.userIdField}`,
          firstEventDate: { $min: `$${cohortConfig.eventDateField}` },
          allEvents: {
            $push: {
              eventDate: `$${cohortConfig.eventDateField}`,
              eventType: `$${cohortConfig.eventTypeField}`,
              eventValue: `$${cohortConfig.valueField}`
            }
          }
        }
      },

      // Stage 3: Add cohort period (week/month of first event)
      {
        $addFields: {
          cohortPeriod: {
            $dateTrunc: {
              date: '$firstEventDate',
              unit: cohortConfig.cohortTimeUnit, // 'week' or 'month'
              binSize: 1
            }
          }
        }
      },

      // Stage 4: Unwind events for period analysis
      { $unwind: '$allEvents' },

      // Stage 5: Calculate periods since cohort start
      {
        $addFields: {
          periodsSinceCohort: {
            $floor: {
              $divide: [
                { $subtract: ['$allEvents.eventDate', '$firstEventDate'] },
                cohortConfig.cohortTimeUnit === 'week' ? 604800000 : 2629746000 // ms in week/month
              ]
            }
          }
        }
      },

      // Stage 6: Group by cohort and period for retention analysis
      {
        $group: {
          _id: {
            cohortPeriod: '$cohortPeriod',
            periodNumber: '$periodsSinceCohort'
          },

          // Cohort metrics
          activeUsers: { $addToSet: '$_id' }, // Unique users active in this period
          totalEvents: { $sum: 1 },
          totalValue: { $sum: '$allEvents.eventValue' },

          // Event type breakdown
          eventTypeBreakdown: {
            $push: {
              eventType: '$allEvents.eventType',
              value: '$allEvents.eventValue'
            }
          }
        }
      },

      // Stage 7: Calculate active user counts
      {
        $addFields: {
          activeUserCount: { $size: '$activeUsers' }
        }
      },

      // Stage 8: Get cohort size (period 0 users) for retention calculation
      {
        $lookup: {
          from: collection,
          let: { 
            cohortPeriod: '$_id.cohortPeriod'
          },
          pipeline: [
            {
              $match: {
                $expr: {
                  $and: [
                    { $eq: ['$$cohortPeriod', '$_id.cohortPeriod'] },
                    { $eq: ['$_id.periodNumber', 0] }
                  ]
                }
              }
            },
            {
              $project: {
                cohortSize: '$activeUserCount',
                _id: 0
              }
            }
          ],
          as: 'cohortSizeData'
        }
      },

      // Stage 9: Calculate retention rates
      {
        $addFields: {
          cohortSize: { 
            $ifNull: [
              { $arrayElemAt: ['$cohortSizeData.cohortSize', 0] },
              '$activeUserCount' // Use current count if period 0 data not found
            ]
          }
        }
      },

      {
        $addFields: {
          retentionRate: {
            $cond: {
              if: { $gt: ['$cohortSize', 0] },
              then: {
                $round: [
                  { $multiply: [{ $divide: ['$activeUserCount', '$cohortSize'] }, 100] },
                  2
                ]
              },
              else: 0
            }
          }
        }
      },

      // Stage 10: Add cohort analysis metrics
      {
        $addFields: {
          // Average events per user
          eventsPerUser: {
            $cond: {
              if: { $gt: ['$activeUserCount', 0] },
              then: { $round: [{ $divide: ['$totalEvents', '$activeUserCount'] }, 2] },
              else: 0
            }
          },

          // Average value per user
          valuePerUser: {
            $cond: {
              if: { $gt: ['$activeUserCount', 0] },
              then: { $round: [{ $divide: ['$totalValue', '$activeUserCount'] }, 2] },
              else: 0
            }
          },

          // Average value per event
          valuePerEvent: {
            $cond: {
              if: { $gt: ['$totalEvents', 0] },
              then: { $round: [{ $divide: ['$totalValue', '$totalEvents'] }, 2] },
              else: 0
            }
          }
        }
      },

      // Stage 11: Group event types for analysis
      {
        $addFields: {
          eventTypeSummary: {
            $reduce: {
              input: '$eventTypeBreakdown',
              initialValue: {},
              in: {
                $mergeObjects: [
                  '$$value',
                  {
                    $arrayToObject: [{
                      k: '$$this.eventType',
                      v: {
                        $add: [
                          { $ifNull: [{ $getField: { field: '$$this.eventType', input: '$$value' } }, 0] },
                          '$$this.value'
                        ]
                      }
                    }]
                  }
                ]
              }
            }
          }
        }
      },

      // Stage 12: Final projection
      {
        $project: {
          cohortPeriod: '$_id.cohortPeriod',
          periodNumber: '$_id.periodNumber',
          cohortSize: '$cohortSize',
          activeUsers: '$activeUserCount',
          retentionRate: '$retentionRate',

          engagement: {
            totalEvents: '$totalEvents',
            eventsPerUser: '$eventsPerUser',
            totalValue: { $round: ['$totalValue', 2] },
            valuePerUser: '$valuePerUser',
            valuePerEvent: '$valuePerEvent'
          },

          eventBreakdown: '$eventTypeSummary',

          // Cohort health indicators
          healthIndicators: {
            isHealthyCohort: { $gte: ['$retentionRate', cohortConfig.healthyRetentionThreshold || 20] },
            engagementLevel: {
              $cond: {
                if: { $gte: ['$eventsPerUser', cohortConfig.highEngagementThreshold || 5] },
                then: 'high',
                else: {
                  $cond: {
                    if: { $gte: ['$eventsPerUser', cohortConfig.mediumEngagementThreshold || 2] },
                    then: 'medium',
                    else: 'low'
                  }
                }
              }
            }
          },

          _id: 0
        }
      },

      // Stage 13: Sort results
      {
        $sort: {
          cohortPeriod: 1,
          periodNumber: 1
        }
      }
    ];
  }

  async createAdvancedRFMAnalysis(collection, rfmConfig) {
    // RFM (Recency, Frequency, Monetary) analysis for customer segmentation
    return [
      // Stage 1: Filter customer transactions
      {
        $match: {
          [rfmConfig.transactionDateField]: {
            $gte: rfmConfig.analysisStartDate,
            $lte: rfmConfig.analysisEndDate
          },
          [rfmConfig.amountField]: { $gt: 0 }
        }
      },

      // Stage 2: Calculate RFM metrics per customer
      {
        $group: {
          _id: `$${rfmConfig.customerIdField}`,

          // Recency: Days since last transaction
          lastTransactionDate: { $max: `$${rfmConfig.transactionDateField}` },

          // Frequency: Number of transactions
          transactionCount: { $sum: 1 },

          // Monetary: Total transaction value
          totalSpent: { $sum: `$${rfmConfig.amountField}` },

          // Additional metrics
          averageTransactionValue: { $avg: `$${rfmConfig.amountField}` },
          firstTransactionDate: { $min: `$${rfmConfig.transactionDateField}` },

          // Transaction patterns
          transactions: {
            $push: {
              date: `$${rfmConfig.transactionDateField}`,
              amount: `$${rfmConfig.amountField}`
            }
          }
        }
      },

      // Stage 3: Calculate recency in days
      {
        $addFields: {
          recencyDays: {
            $floor: {
              $divide: [
                { $subtract: [rfmConfig.currentDate, '$lastTransactionDate'] },
                86400000 // milliseconds in a day
              ]
            }
          },

          customerLifetimeDays: {
            $floor: {
              $divide: [
                { $subtract: ['$lastTransactionDate', '$firstTransactionDate'] },
                86400000
              ]
            }
          }
        }
      },

      // Stage 4: Calculate percentiles for scoring using window functions
      {
        $setWindowFields: {
          sortBy: { recencyDays: 1 },
          output: {
            recencyPercentile: {
              $percentRank: {},
              window: {
                documents: ['unbounded preceding', 'unbounded following']
              }
            }
          }
        }
      },

      {
        $setWindowFields: {
          sortBy: { transactionCount: 1 },
          output: {
            frequencyPercentile: {
              $percentRank: {},
              window: {
                documents: ['unbounded preceding', 'unbounded following']
              }
            }
          }
        }
      },

      {
        $setWindowFields: {
          sortBy: { totalSpent: 1 },
          output: {
            monetaryPercentile: {
              $percentRank: {},
              window: {
                documents: ['unbounded preceding', 'unbounded following']
              }
            }
          }
        }
      },

      // Stage 5: Calculate RFM scores (1-5 scale)
      {
        $addFields: {
          recencyScore: {
            $cond: {
              if: { $lte: ['$recencyPercentile', 0.2] },
              then: 5, // Most recent customers get highest score
              else: {
                $cond: {
                  if: { $lte: ['$recencyPercentile', 0.4] },
                  then: 4,
                  else: {
                    $cond: {
                      if: { $lte: ['$recencyPercentile', 0.6] },
                      then: 3,
                      else: {
                        $cond: {
                          if: { $lte: ['$recencyPercentile', 0.8] },
                          then: 2,
                          else: 1
                        }
                      }
                    }
                  }
                }
              }
            }
          },

          frequencyScore: {
            $cond: {
              if: { $gte: ['$frequencyPercentile', 0.8] },
              then: 5,
              else: {
                $cond: {
                  if: { $gte: ['$frequencyPercentile', 0.6] },
                  then: 4,
                  else: {
                    $cond: {
                      if: { $gte: ['$frequencyPercentile', 0.4] },
                      then: 3,
                      else: {
                        $cond: {
                          if: { $gte: ['$frequencyPercentile', 0.2] },
                          then: 2,
                          else: 1
                        }
                      }
                    }
                  }
                }
              }
            }
          },

          monetaryScore: {
            $cond: {
              if: { $gte: ['$monetaryPercentile', 0.8] },
              then: 5,
              else: {
                $cond: {
                  if: { $gte: ['$monetaryPercentile', 0.6] },
                  then: 4,
                  else: {
                    $cond: {
                      if: { $gte: ['$monetaryPercentile', 0.4] },
                      then: 3,
                      else: {
                        $cond: {
                          if: { $gte: ['$monetaryPercentile', 0.2] },
                          then: 2,
                          else: 1
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        }
      },

      // Stage 6: Create combined RFM score and segment
      {
        $addFields: {
          rfmScore: {
            $concat: [
              { $toString: '$recencyScore' },
              { $toString: '$frequencyScore' },
              { $toString: '$monetaryScore' }
            ]
          },

          // Calculate overall customer value score
          customerValueScore: {
            $round: [
              {
                $add: [
                  { $multiply: ['$recencyScore', rfmConfig.recencyWeight || 0.3] },
                  { $multiply: ['$frequencyScore', rfmConfig.frequencyWeight || 0.3] },
                  { $multiply: ['$monetaryScore', rfmConfig.monetaryWeight || 0.4] }
                ]
              },
              2
            ]
          }
        }
      },

      // Stage 7: Assign customer segments
      {
        $addFields: {
          customerSegment: {
            $switch: {
              branches: [
                {
                  case: { 
                    $and: [
                      { $gte: ['$recencyScore', 4] },
                      { $gte: ['$frequencyScore', 4] },
                      { $gte: ['$monetaryScore', 4] }
                    ]
                  },
                  then: 'Champions'
                },
                {
                  case: { 
                    $and: [
                      { $gte: ['$recencyScore', 3] },
                      { $gte: ['$frequencyScore', 3] },
                      { $gte: ['$monetaryScore', 4] }
                    ]
                  },
                  then: 'Loyal Customers'
                },
                {
                  case: { 
                    $and: [
                      { $gte: ['$recencyScore', 4] },
                      { $lte: ['$frequencyScore', 2] },
                      { $gte: ['$monetaryScore', 3] }
                    ]
                  },
                  then: 'Potential Loyalists'
                },
                {
                  case: { 
                    $and: [
                      { $gte: ['$recencyScore', 4] },
                      { $lte: ['$frequencyScore', 1] },
                      { $lte: ['$monetaryScore', 1] }
                    ]
                  },
                  then: 'New Customers'
                },
                {
                  case: { 
                    $and: [
                      { $gte: ['$recencyScore', 3] },
                      { $lte: ['$frequencyScore', 3] },
                      { $gte: ['$monetaryScore', 3] }
                    ]
                  },
                  then: 'Promising'
                },
                {
                  case: { 
                    $and: [
                      { $lte: ['$recencyScore', 2] },
                      { $gte: ['$frequencyScore', 3] },
                      { $gte: ['$monetaryScore', 3] }
                    ]
                  },
                  then: 'Need Attention'
                },
                {
                  case: { 
                    $and: [
                      { $lte: ['$recencyScore', 2] },
                      { $lte: ['$frequencyScore', 2] },
                      { $gte: ['$monetaryScore', 3] }
                    ]
                  },
                  then: 'About to Sleep'
                },
                {
                  case: { 
                    $and: [
                      { $lte: ['$recencyScore', 2] },
                      { $gte: ['$frequencyScore', 4] },
                      { $lte: ['$monetaryScore', 2] }
                    ]
                  },
                  then: 'At Risk'
                },
                {
                  case: { 
                    $and: [
                      { $lte: ['$recencyScore', 1] },
                      { $gte: ['$frequencyScore', 4] },
                      { $gte: ['$monetaryScore', 4] }
                    ]
                  },
                  then: 'Cannot Lose Them'
                },
                {
                  case: { 
                    $and: [
                      { $eq: ['$recencyScore', 3] },
                      { $eq: ['$frequencyScore', 1] },
                      { $eq: ['$monetaryScore', 1] }
                    ]
                  },
                  then: 'Hibernating'
                }
              ],
              default: 'Lost Customers'
            }
          }
        }
      },

      // Stage 8: Add customer insights and recommendations
      {
        $addFields: {
          insights: {
            daysSinceLastPurchase: '$recencyDays',
            lifetimeValue: { $round: ['$totalSpent', 2] },
            averageOrderValue: { $round: ['$averageTransactionValue', 2] },
            purchaseFrequency: {
              $cond: {
                if: { $gt: ['$customerLifetimeDays', 0] },
                then: { 
                  $round: [
                    { $divide: ['$transactionCount', { $divide: ['$customerLifetimeDays', 30] }] },
                    2
                  ]
                },
                else: 0
              }
            },

            // Customer lifecycle stage
            lifecycleStage: {
              $cond: {
                if: { $lte: ['$customerLifetimeDays', 30] },
                then: 'New',
                else: {
                  $cond: {
                    if: { $lte: ['$customerLifetimeDays', 180] },
                    then: 'Developing',
                    else: {
                      $cond: {
                        if: { $lte: ['$recencyDays', 90] },
                        then: 'Established',
                        else: 'Declining'
                      }
                    }
                  }
                }
              }
            }
          },

          // Marketing recommendations
          recommendations: {
            $switch: {
              branches: [
                {
                  case: { $eq: ['$customerSegment', 'Champions'] },
                  then: ['Reward loyalty', 'VIP treatment', 'Brand advocacy program']
                },
                {
                  case: { $eq: ['$customerSegment', 'New Customers'] },
                  then: ['Onboarding campaign', 'Product education', 'Early engagement']
                },
                {
                  case: { $eq: ['$customerSegment', 'At Risk'] },
                  then: ['Win-back campaign', 'Special offers', 'Survey for feedback']
                },
                {
                  case: { $eq: ['$customerSegment', 'Lost Customers'] },
                  then: ['Aggressive win-back offers', 'Product updates', 'Reactivation campaign']
                }
              ],
              default: ['Standard marketing', 'Regular engagement']
            }
          }
        }
      },

      // Stage 9: Final projection
      {
        $project: {
          customerId: '$_id',
          rfmScores: {
            recency: '$recencyScore',
            frequency: '$frequencyScore',
            monetary: '$monetaryScore',
            combined: '$rfmScore',
            customerValue: '$customerValueScore'
          },
          segment: '$customerSegment',
          insights: '$insights',
          recommendations: '$recommendations',
          rawMetrics: {
            recencyDays: '$recencyDays',
            transactionCount: '$transactionCount',
            totalSpent: { $round: ['$totalSpent', 2] },
            averageTransactionValue: { $round: ['$averageTransactionValue', 2] },
            customerLifetimeDays: '$customerLifetimeDays'
          },
          _id: 0
        }
      },

      // Stage 10: Sort by customer value score
      {
        $sort: {
          'rfmScores.customerValue': -1,
          'rawMetrics.totalSpent': -1
        }
      }
    ];
  }
}

SQL-Style Aggregation Optimization with QueryLeaf

QueryLeaf provides familiar SQL approaches to MongoDB aggregation optimization:

-- QueryLeaf aggregation optimization with SQL-style syntax

-- Optimized complex analytics query with early filtering
WITH filtered_data AS (
  SELECT *
  FROM orders 
  WHERE order_date >= '2024-01-01'
    AND order_date <= '2024-12-31'
    AND status IN ('completed', 'shipped')
  -- QueryLeaf optimizes this to use compound index on (order_date, status)
),

enriched_data AS (
  SELECT 
    o.*,
    c.region_id,
    c.customer_segment,
    r.region_name,
    oi.product_id,
    oi.quantity,
    oi.unit_price,
    p.category,
    p.subcategory,
    p.cost_basis,

    -- Calculate metrics early in pipeline
    (oi.quantity * oi.unit_price) as item_revenue,
    (oi.quantity * p.cost_basis) as item_cost

  FROM filtered_data o
  -- QueryLeaf optimizes joins with $lookup sub-pipelines
  JOIN customers c ON o.customer_id = c.customer_id
  JOIN regions r ON c.region_id = r.region_id
  CROSS JOIN UNNEST(o.items) AS oi
  JOIN products p ON oi.product_id = p.product_id
),

monthly_aggregates AS (
  SELECT 
    DATE_TRUNC('month', order_date) as month,
    region_name,
    category,
    subcategory,
    customer_segment,

    -- Standard aggregations
    COUNT(*) as order_count,
    COUNT(DISTINCT customer_id) as unique_customers,
    SUM(item_revenue) as total_revenue,
    SUM(item_cost) as total_cost,
    (SUM(item_revenue) - SUM(item_cost)) as profit,
    AVG(item_revenue) as avg_item_revenue,

    -- Statistical measures  
    STDDEV_POP(item_revenue) as revenue_stddev,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY item_revenue) as median_revenue,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY item_revenue) as p95_revenue,

    -- Collect sample for detailed analysis
    ARRAY_AGG(item_revenue ORDER BY item_revenue DESC LIMIT 100) as top_revenues

  FROM enriched_data
  GROUP BY 
    DATE_TRUNC('month', order_date),
    region_name,
    category, 
    subcategory,
    customer_segment
  -- QueryLeaf creates efficient $group stage with proper field projections
)

-- Advanced window functions for trend analysis
SELECT 
  month,
  region_name,
  category,
  subcategory,
  customer_segment,

  -- Core metrics
  order_count,
  unique_customers,
  total_revenue,
  profit,
  ROUND(profit / total_revenue * 100, 2) as profit_margin_pct,
  ROUND(total_revenue / unique_customers, 2) as revenue_per_customer,

  -- Trend analysis using window functions
  LAG(total_revenue, 1) OVER (
    PARTITION BY region_name, category, customer_segment 
    ORDER BY month
  ) as previous_month_revenue,

  -- Growth calculations
  ROUND(
    ((total_revenue - LAG(total_revenue, 1) OVER (
      PARTITION BY region_name, category, customer_segment 
      ORDER BY month
    )) / LAG(total_revenue, 1) OVER (
      PARTITION BY region_name, category, customer_segment 
      ORDER BY month
    )) * 100, 2
  ) as month_over_month_growth,

  -- Moving averages
  AVG(total_revenue) OVER (
    PARTITION BY region_name, category, customer_segment 
    ORDER BY month 
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) as moving_avg_3month,

  AVG(total_revenue) OVER (
    PARTITION BY region_name, category, customer_segment 
    ORDER BY month 
    ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
  ) as moving_avg_6month,

  -- Cumulative totals
  SUM(total_revenue) OVER (
    PARTITION BY region_name, category, customer_segment 
    ORDER BY month 
    ROWS UNBOUNDED PRECEDING
  ) as cumulative_revenue,

  -- Rankings and percentiles
  RANK() OVER (
    PARTITION BY month 
    ORDER BY total_revenue DESC
  ) as revenue_rank,

  PERCENT_RANK() OVER (
    PARTITION BY month 
    ORDER BY total_revenue
  ) as revenue_percentile,

  -- Volatility measures
  STDDEV(total_revenue) OVER (
    PARTITION BY region_name, category, customer_segment 
    ORDER BY month 
    ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
  ) as revenue_volatility,

  -- Min/Max within window
  MIN(total_revenue) OVER (
    PARTITION BY region_name, category, customer_segment 
    ORDER BY month 
    ROWS BETWEEN 5 PRECEDING AND 5 FOLLOWING
  ) as window_min,

  MAX(total_revenue) OVER (
    PARTITION BY region_name, category, customer_segment 
    ORDER BY month 
    ROWS BETWEEN 5 PRECEDING AND 5 FOLLOWING  
  ) as window_max,

  -- Position within range
  CASE
    WHEN MAX(total_revenue) OVER (...) - MIN(total_revenue) OVER (...) > 0
    THEN ROUND(
      ((total_revenue - MIN(total_revenue) OVER (...)) / 
       (MAX(total_revenue) OVER (...) - MIN(total_revenue) OVER (...)
      )) * 100, 1
    )
    ELSE 50.0
  END as position_in_range_pct

FROM monthly_aggregates
WHERE month >= '2024-06-01' -- Filter for recent months
ORDER BY month, region_name, category, total_revenue DESC

-- QueryLeaf optimization features:
-- ALLOW_DISK_USE for large aggregations
-- MAX_TIME_MS for timeout control  
-- HINT for index suggestions
-- READ_CONCERN for consistency control
WITH AGGREGATION_OPTIONS (
  ALLOW_DISK_USE = true,
  MAX_TIME_MS = 300000,
  HINT = 'order_date_status_idx',
  READ_CONCERN = 'majority'
);

-- Performance monitoring and optimization
SELECT 
  stage_name,
  execution_time_ms,
  documents_examined,
  documents_returned,
  index_used,
  memory_usage_mb,

  -- Efficiency metrics
  ROUND(documents_returned::FLOAT / documents_examined, 4) as selectivity,
  ROUND(documents_returned / (execution_time_ms / 1000.0), 0) as docs_per_second,

  -- Performance flags
  CASE 
    WHEN execution_time_ms > 30000 THEN 'SLOW_STAGE'
    WHEN documents_examined > documents_returned * 100 THEN 'INEFFICIENT_FILTERING' 
    WHEN NOT index_used AND documents_examined > 10000 THEN 'MISSING_INDEX'
    ELSE 'OPTIMAL'
  END as performance_flag,

  -- Optimization recommendations
  CASE
    WHEN NOT index_used AND documents_examined > 10000 
      THEN 'Add index for this stage'
    WHEN documents_examined > documents_returned * 100 
      THEN 'Move filtering earlier in pipeline'
    WHEN memory_usage_mb > 100 
      THEN 'Consider using allowDiskUse'
    ELSE 'No optimization needed'
  END as recommendation

FROM EXPLAIN_AGGREGATION_PIPELINE('orders', @pipeline_query)
ORDER BY execution_time_ms DESC;

-- Index recommendations based on aggregation patterns
WITH pipeline_analysis AS (
  SELECT 
    collection_name,
    stage_type,
    stage_index,
    field_name,
    operation_type,
    estimated_improvement
  FROM ANALYZE_AGGREGATION_INDEXES(@common_pipelines)
),

index_recommendations AS (
  SELECT 
    collection_name,
    STRING_AGG(field_name, ', ' ORDER BY stage_index) as compound_index_fields,
    COUNT(*) as stages_optimized,
    MAX(estimated_improvement) as max_improvement,
    STRING_AGG(DISTINCT operation_type, ', ') as optimization_types
  FROM pipeline_analysis
  GROUP BY collection_name
)

SELECT 
  collection_name,
  'CREATE INDEX idx_' || REPLACE(compound_index_fields, ', ', '_') || 
  ' ON ' || collection_name || ' (' || compound_index_fields || ')' as create_index_statement,
  stages_optimized,
  max_improvement as estimated_improvement,
  optimization_types,

  -- Priority scoring
  CASE 
    WHEN max_improvement = 'high' AND stages_optimized >= 3 THEN 1
    WHEN max_improvement = 'high' AND stages_optimized >= 2 THEN 2
    WHEN max_improvement = 'medium' AND stages_optimized >= 3 THEN 3
    ELSE 4
  END as priority_rank

FROM index_recommendations
ORDER BY priority_rank, stages_optimized DESC;

-- Memory usage optimization strategies
SELECT 
  pipeline_name,
  total_memory_mb,
  peak_memory_mb,
  documents_processed,

  -- Memory efficiency metrics
  ROUND(peak_memory_mb / (documents_processed / 1000.0), 2) as mb_per_1k_docs,

  -- Memory optimization recommendations
  CASE
    WHEN peak_memory_mb > 500 THEN 'Use allowDiskUse: true'
    WHEN mb_per_1k_docs > 10 THEN 'Reduce projection fields early'
    WHEN documents_processed > 1000000 THEN 'Consider batch processing'
    ELSE 'Memory usage optimal'
  END as memory_recommendation,

  -- Suggested batch size for large datasets
  CASE
    WHEN peak_memory_mb > 1000 THEN 10000
    WHEN peak_memory_mb > 500 THEN 25000  
    WHEN peak_memory_mb > 100 THEN 50000
    ELSE NULL
  END as suggested_batch_size

FROM PIPELINE_PERFORMANCE_METRICS()
WHERE total_memory_mb > 50 -- Focus on memory-intensive pipelines
ORDER BY peak_memory_mb DESC;

-- QueryLeaf aggregation optimization provides:
-- 1. Automatic pipeline stage reordering for optimal performance
-- 2. Index usage hints and recommendations
-- 3. Memory management with disk spilling controls
-- 4. Window function optimization with efficient partitioning
-- 5. Early filtering and projection optimization
-- 6. Compound index recommendations based on pipeline analysis
-- 7. Performance monitoring and bottleneck identification
-- 8. Batch processing strategies for large datasets
-- 9. SQL-familiar syntax for complex analytical operations
-- 10. Integration with MongoDB's native aggregation performance features

Best Practices for Aggregation Pipeline Optimization

Performance Design Guidelines

Essential practices for high-performance aggregation pipelines:

Early Filtering: Move $match stages as early as possible to reduce data volume
Index Utilization: Design compound indexes specifically for aggregation patterns
Memory Management: Use allowDiskUse: true for large datasets
Stage Ordering: Optimize stage sequence to minimize document flow
Projection Optimization: Project only necessary fields at each stage
Lookup Efficiency: Use sub-pipelines in $lookup to reduce data transfer

Monitoring and Optimization

Implement comprehensive performance monitoring:

Execution Analysis: Use explain() to identify bottlenecks and inefficiencies
Memory Tracking: Monitor memory usage patterns and disk spilling
Index Usage: Verify optimal index utilization across pipeline stages
Performance Metrics: Track execution times and document processing rates
Resource Utilization: Monitor CPU, memory, and I/O during aggregations
Benchmark Comparison: Establish performance baselines and track improvements

Conclusion

MongoDB aggregation pipeline optimization requires strategic approach to stage ordering, memory management, and index design. Unlike traditional SQL query optimization that relies on automated query planners, MongoDB aggregation optimization demands understanding of pipeline execution, data flow patterns, and resource utilization characteristics.

Key optimization benefits include:

Predictable Performance: Optimized pipelines deliver consistent execution times regardless of data growth
Efficient Resource Usage: Strategic memory management and disk spilling prevent resource exhaustion
Scalable Analytics: Proper optimization enables complex analytics on large datasets
Index Integration: Strategic indexing dramatically improves pipeline performance
Flexible Processing: Support for complex analytical operations with optimal resource usage

Whether you're building real-time analytics platforms, business intelligence systems, or complex data transformation pipelines, MongoDB aggregation optimization with QueryLeaf's familiar SQL interface provides the foundation for high-performance analytical processing. This combination enables you to implement sophisticated analytics solutions while preserving familiar query patterns and optimization approaches.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB aggregation pipeline execution through intelligent stage reordering, index recommendations, and memory management while providing SQL-familiar syntax for complex analytical operations. Advanced window functions, statistical calculations, and performance monitoring are seamlessly handled through familiar SQL patterns, making high-performance analytics both powerful and accessible.

The integration of sophisticated aggregation optimization with SQL-style analytics makes MongoDB an ideal platform for applications requiring both complex analytical processing and familiar database interaction patterns, ensuring your analytics solutions remain both performant and maintainable as they scale and evolve.

September 11, 2025
20 min read

MongoDB GridFS and File Storage: SQL-Style Binary Data Management with Metadata Integration

Modern applications increasingly handle diverse file types - documents, images, videos, audio files, backups, and large datasets. Traditional relational databases struggle with binary data storage, often requiring external file systems, complex blob handling, or separate storage services that create synchronization challenges and architectural complexity.

MongoDB GridFS provides native large file storage capabilities directly within your database, enabling seamless binary data management with integrated metadata, automatic chunking for large files, and powerful querying capabilities. Unlike external file storage solutions, GridFS maintains transactional consistency, provides built-in replication, and integrates file operations with your existing database queries.

The File Storage Challenge

Traditional approaches to file storage have significant limitations:

-- Traditional SQL file storage approaches - complex and fragmented

-- Option 1: Store file paths only (external file system)
CREATE TABLE documents (
    document_id SERIAL PRIMARY KEY,
    title VARCHAR(255) NOT NULL,
    file_path VARCHAR(500) NOT NULL,    -- Path to external file
    file_size BIGINT,
    mime_type VARCHAR(100),
    upload_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    uploaded_by INTEGER REFERENCES users(user_id)
);

-- Insert file reference
INSERT INTO documents (title, file_path, file_size, mime_type, uploaded_by)
VALUES ('Annual Report 2024', '/files/2024/annual-report.pdf', 2048576, 'application/pdf', 123);

-- Problems with external file storage:
-- - File system and database can become out of sync
-- - No transactional consistency between file and metadata
-- - Complex backup and replication strategies
-- - Permission and security management split between systems
-- - No atomic operations across file and metadata
-- - Difficult to query file content and metadata together

-- Option 2: Store files as BLOBs (limited and inefficient)
CREATE TABLE file_storage (
    file_id SERIAL PRIMARY KEY,
    filename VARCHAR(255),
    file_data BYTEA,           -- Binary data (PostgreSQL)
    -- file_data LONGBLOB,     -- Binary data (MySQL)
    file_size INTEGER,
    mime_type VARCHAR(100),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Insert binary data
INSERT INTO file_storage (filename, file_data, file_size, mime_type)
VALUES ('document.pdf', pg_read_binary_file('/tmp/document.pdf'), 1048576, 'application/pdf');

-- Problems with BLOB storage:
-- - Size limitations (often 16MB-4GB depending on database)
-- - Memory issues when loading large files
-- - Poor performance for streaming large files
-- - Limited metadata and search capabilities
-- - Difficult to handle partial file operations
-- - Database backup sizes become unmanageable
-- - No built-in file chunking or streaming support

MongoDB GridFS solves these challenges comprehensively:

// MongoDB GridFS - native large file storage with integrated metadata
const { GridFSBucket, MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('document_management');

// Create GridFS bucket for file operations
const bucket = new GridFSBucket(db, { 
  bucketName: 'documents',
  chunkSizeBytes: 1024 * 1024  // 1MB chunks for optimal performance
});

// Store file with rich metadata - no size limits
const uploadStream = bucket.openUploadStream('annual-report-2024.pdf', {
  metadata: {
    title: 'Annual Report 2024',
    description: 'Company annual financial report for 2024',
    category: 'financial-reports',
    department: 'finance',
    confidentialityLevel: 'internal',
    uploadedBy: ObjectId('64f1a2c4567890abcdef1234'),
    uploadedByName: 'John Smith',
    tags: ['annual', 'report', '2024', 'finance', 'quarterly'],
    approvalStatus: 'pending',
    version: '1.0',
    relatedDocuments: [
      ObjectId('64f1a2c4567890abcdef5678'),
      ObjectId('64f1a2c4567890abcdef9abc')
    ],
    accessPermissions: {
      read: ['finance', 'management', 'audit'],
      write: ['finance'],
      admin: ['finance-manager']
    }
  }
});

// Stream file data efficiently (handles files of any size)
const fs = require('fs');
fs.createReadStream('./annual-report-2024.pdf')
  .pipe(uploadStream)
  .on('error', (error) => {
    console.error('File upload failed:', error);
  })
  .on('finish', () => {
    console.log('File uploaded successfully:', uploadStream.id);
  });

// Benefits of GridFS:
// - No file size limitations (handles multi-GB files efficiently)
// - Automatic chunking and streaming for optimal memory usage
// - Rich metadata storage with full query capabilities
// - Transactional consistency between file data and metadata
// - Built-in replication and backup with your database
// - Powerful file search and filtering capabilities
// - Atomic file operations with metadata updates
// - Integration with MongoDB aggregation pipeline

Understanding MongoDB GridFS

GridFS Architecture and File Operations

Implement comprehensive file management systems:

// Advanced GridFS file management system
class GridFSFileManager {
  constructor(db, bucketName = 'files') {
    this.db = db;
    this.bucketName = bucketName;
    this.bucket = new GridFSBucket(db, {
      bucketName: bucketName,
      chunkSizeBytes: 1024 * 1024 // 1MB chunks
    });

    // Collections automatically created by GridFS
    this.filesCollection = db.collection(`${bucketName}.files`);
    this.chunksCollection = db.collection(`${bucketName}.chunks`);
  }

  async uploadFile(filePath, filename, metadata = {}) {
    // Upload file with comprehensive metadata
    return new Promise((resolve, reject) => {
      const fs = require('fs');
      const uploadStream = this.bucket.openUploadStream(filename, {
        metadata: {
          ...metadata,
          uploadDate: new Date(),
          originalPath: filePath,
          fileStats: fs.statSync(filePath),
          checksum: this.calculateChecksum(filePath),
          contentAnalysis: this.analyzeFileContent(filePath, metadata.mimeType)
        }
      });

      fs.createReadStream(filePath)
        .pipe(uploadStream)
        .on('error', reject)
        .on('finish', () => {
          resolve({
            fileId: uploadStream.id,
            filename: filename,
            uploadDate: new Date(),
            metadata: metadata
          });
        });
    });
  }

  async uploadFromBuffer(buffer, filename, metadata = {}) {
    // Upload file from memory buffer
    return new Promise((resolve, reject) => {
      const uploadStream = this.bucket.openUploadStream(filename, {
        metadata: {
          ...metadata,
          uploadDate: new Date(),
          bufferSize: buffer.length,
          source: 'buffer'
        }
      });

      const { Readable } = require('stream');
      const bufferStream = new Readable();
      bufferStream.push(buffer);
      bufferStream.push(null);

      bufferStream
        .pipe(uploadStream)
        .on('error', reject)
        .on('finish', () => {
          resolve({
            fileId: uploadStream.id,
            filename: filename,
            size: buffer.length
          });
        });
    });
  }

  async downloadFile(fileId, outputPath) {
    // Download file to local filesystem
    return new Promise((resolve, reject) => {
      const fs = require('fs');
      const downloadStream = this.bucket.openDownloadStream(ObjectId(fileId));
      const writeStream = fs.createWriteStream(outputPath);

      downloadStream
        .pipe(writeStream)
        .on('error', reject)
        .on('finish', () => {
          resolve({
            fileId: fileId,
            downloadPath: outputPath,
            downloadDate: new Date()
          });
        });

      downloadStream.on('error', reject);
    });
  }

  async getFileBuffer(fileId) {
    // Get file as buffer for in-memory processing
    return new Promise((resolve, reject) => {
      const downloadStream = this.bucket.openDownloadStream(ObjectId(fileId));
      const chunks = [];

      downloadStream.on('data', (chunk) => {
        chunks.push(chunk);
      });

      downloadStream.on('error', reject);

      downloadStream.on('end', () => {
        const buffer = Buffer.concat(chunks);
        resolve(buffer);
      });
    });
  }

  async streamFileToResponse(fileId, response) {
    // Stream file directly to HTTP response (efficient for web serving)
    const file = await this.getFileMetadata(fileId);

    if (!file) {
      throw new Error(`File ${fileId} not found`);
    }

    // Set appropriate headers
    response.set({
      'Content-Type': file.metadata?.mimeType || 'application/octet-stream',
      'Content-Length': file.length,
      'Content-Disposition': `inline; filename="${file.filename}"`,
      'Cache-Control': 'public, max-age=3600',
      'ETag': `"${file.md5}"`
    });

    const downloadStream = this.bucket.openDownloadStream(ObjectId(fileId));

    return new Promise((resolve, reject) => {
      downloadStream
        .pipe(response)
        .on('error', reject)
        .on('finish', resolve);

      downloadStream.on('error', reject);
    });
  }

  async getFileMetadata(fileId) {
    // Get comprehensive file metadata
    const file = await this.filesCollection.findOne({ 
      _id: ObjectId(fileId) 
    });

    if (!file) {
      return null;
    }

    return {
      fileId: file._id,
      filename: file.filename,
      length: file.length,
      chunkSize: file.chunkSize,
      uploadDate: file.uploadDate,
      md5: file.md5,
      metadata: file.metadata || {},

      // Additional computed properties
      humanSize: this.formatFileSize(file.length),
      mimeType: file.metadata?.mimeType,
      category: file.metadata?.category,
      tags: file.metadata?.tags || [],

      // File analysis
      chunkCount: Math.ceil(file.length / file.chunkSize),
      isComplete: await this.verifyFileIntegrity(fileId)
    };
  }

  async searchFiles(searchCriteria) {
    // Advanced file search with metadata querying
    const query = {};

    // Filename search
    if (searchCriteria.filename) {
      query.filename = new RegExp(searchCriteria.filename, 'i');
    }

    // Metadata searches
    if (searchCriteria.category) {
      query['metadata.category'] = searchCriteria.category;
    }

    if (searchCriteria.tags) {
      query['metadata.tags'] = { $in: searchCriteria.tags };
    }

    if (searchCriteria.mimeType) {
      query['metadata.mimeType'] = searchCriteria.mimeType;
    }

    if (searchCriteria.uploadedBy) {
      query['metadata.uploadedBy'] = ObjectId(searchCriteria.uploadedBy);
    }

    // Date range search
    if (searchCriteria.dateRange) {
      query.uploadDate = {
        $gte: new Date(searchCriteria.dateRange.start),
        $lte: new Date(searchCriteria.dateRange.end)
      };
    }

    // Size range search
    if (searchCriteria.sizeRange) {
      query.length = {
        $gte: searchCriteria.sizeRange.min || 0,
        $lte: searchCriteria.sizeRange.max || Number.MAX_SAFE_INTEGER
      };
    }

    const files = await this.filesCollection
      .find(query)
      .sort({ uploadDate: -1 })
      .limit(searchCriteria.limit || 50)
      .toArray();

    return files.map(file => ({
      fileId: file._id,
      filename: file.filename,
      size: file.length,
      humanSize: this.formatFileSize(file.length),
      uploadDate: file.uploadDate,
      metadata: file.metadata || {},
      md5: file.md5
    }));
  }

  async updateFileMetadata(fileId, metadataUpdate) {
    // Update file metadata without modifying file content
    const result = await this.filesCollection.updateOne(
      { _id: ObjectId(fileId) },
      { 
        $set: {
          'metadata.lastModified': new Date(),
          ...Object.keys(metadataUpdate).reduce((acc, key) => {
            acc[`metadata.${key}`] = metadataUpdate[key];
            return acc;
          }, {})
        }
      }
    );

    if (result.modifiedCount === 0) {
      throw new Error(`File ${fileId} not found or metadata unchanged`);
    }

    return await this.getFileMetadata(fileId);
  }

  async deleteFile(fileId) {
    // Delete file and all its chunks
    try {
      await this.bucket.delete(ObjectId(fileId));

      // Log deletion for audit
      await this.db.collection('file_audit_log').insertOne({
        operation: 'delete',
        fileId: ObjectId(fileId),
        deletedAt: new Date(),
        deletedBy: 'system' // Could be passed as parameter
      });

      return { 
        success: true, 
        fileId: fileId,
        deletedAt: new Date()
      };
    } catch (error) {
      throw new Error(`Failed to delete file ${fileId}: ${error.message}`);
    }
  }

  async duplicateFile(fileId, newFilename, metadataChanges = {}) {
    // Create a duplicate of an existing file
    const originalFile = await this.getFileMetadata(fileId);
    if (!originalFile) {
      throw new Error(`Original file ${fileId} not found`);
    }

    const buffer = await this.getFileBuffer(fileId);

    const newMetadata = {
      ...originalFile.metadata,
      ...metadataChanges,
      originalFileId: ObjectId(fileId),
      duplicatedAt: new Date(),
      duplicatedFrom: originalFile.filename
    };

    return await this.uploadFromBuffer(buffer, newFilename, newMetadata);
  }

  async getFilesByCategory(category, options = {}) {
    // Get files by category with optional sorting and pagination
    const query = { 'metadata.category': category };

    let cursor = this.filesCollection.find(query);

    if (options.sortBy) {
      const sortField = options.sortBy === 'size' ? 'length' : 
                       options.sortBy === 'date' ? 'uploadDate' : 
                       options.sortBy;
      cursor = cursor.sort({ [sortField]: options.sortOrder === 'asc' ? 1 : -1 });
    }

    if (options.skip) cursor = cursor.skip(options.skip);
    if (options.limit) cursor = cursor.limit(options.limit);

    const files = await cursor.toArray();

    return {
      category: category,
      files: files.map(file => ({
        fileId: file._id,
        filename: file.filename,
        size: file.length,
        humanSize: this.formatFileSize(file.length),
        uploadDate: file.uploadDate,
        metadata: file.metadata
      })),
      totalCount: await this.filesCollection.countDocuments(query)
    };
  }

  async getStorageStatistics() {
    // Get comprehensive storage statistics
    const stats = await this.filesCollection.aggregate([
      {
        $group: {
          _id: null,
          totalFiles: { $sum: 1 },
          totalSize: { $sum: '$length' },
          avgFileSize: { $avg: '$length' },
          oldestFile: { $min: '$uploadDate' },
          newestFile: { $max: '$uploadDate' }
        }
      }
    ]).toArray();

    const categoryStats = await this.filesCollection.aggregate([
      {
        $group: {
          _id: '$metadata.category',
          count: { $sum: 1 },
          totalSize: { $sum: '$length' },
          avgSize: { $avg: '$length' }
        }
      },
      { $sort: { totalSize: -1 } }
    ]).toArray();

    const mimeTypeStats = await this.filesCollection.aggregate([
      {
        $group: {
          _id: '$metadata.mimeType',
          count: { $sum: 1 },
          totalSize: { $sum: '$length' }
        }
      },
      { $sort: { count: -1 } }
    ]).toArray();

    const chunkStats = await this.chunksCollection.aggregate([
      {
        $group: {
          _id: null,
          totalChunks: { $sum: 1 },
          avgChunkSize: { $avg: { $binarySize: '$data' } }
        }
      }
    ]).toArray();

    return {
      overview: stats[0] || {
        totalFiles: 0,
        totalSize: 0,
        avgFileSize: 0
      },
      byCategory: categoryStats,
      byMimeType: mimeTypeStats.slice(0, 10), // Top 10 mime types
      chunkStatistics: chunkStats[0] || {},
      humanReadable: {
        totalSize: this.formatFileSize(stats[0]?.totalSize || 0),
        avgFileSize: this.formatFileSize(stats[0]?.avgFileSize || 0)
      }
    };
  }

  async verifyFileIntegrity(fileId) {
    // Verify file integrity by checking chunks
    const file = await this.filesCollection.findOne({ _id: ObjectId(fileId) });
    if (!file) return false;

    const expectedChunks = Math.ceil(file.length / file.chunkSize);
    const actualChunks = await this.chunksCollection.countDocuments({
      files_id: ObjectId(fileId)
    });

    return expectedChunks === actualChunks;
  }

  formatFileSize(bytes) {
    // Human-readable file size formatting
    if (bytes === 0) return '0 B';

    const units = ['B', 'KB', 'MB', 'GB', 'TB'];
    const base = 1024;
    const unitIndex = Math.floor(Math.log(bytes) / Math.log(base));
    const size = bytes / Math.pow(base, unitIndex);

    return `${size.toFixed(2)} ${units[unitIndex]}`;
  }

  calculateChecksum(filePath) {
    // Calculate MD5 checksum for file integrity
    const crypto = require('crypto');
    const fs = require('fs');
    const hash = crypto.createHash('md5');
    const data = fs.readFileSync(filePath);
    return hash.update(data).digest('hex');
  }

  analyzeFileContent(filePath, mimeType) {
    // Basic file content analysis
    const fs = require('fs');
    const stats = fs.statSync(filePath);

    const analysis = {
      isExecutable: stats.mode & parseInt('111', 8),
      lastModified: stats.mtime,
      createdAt: stats.birthtime,
      fileType: this.getFileTypeFromMime(mimeType)
    };

    // Additional analysis based on file type
    if (mimeType && mimeType.startsWith('image/')) {
      analysis.category = 'image';
      // Could add image dimension analysis here
    } else if (mimeType && mimeType.startsWith('video/')) {
      analysis.category = 'video';
      // Could add video metadata extraction here
    } else if (mimeType && mimeType.includes('pdf')) {
      analysis.category = 'document';
      // Could add PDF metadata extraction here
    }

    return analysis;
  }

  getFileTypeFromMime(mimeType) {
    if (!mimeType) return 'unknown';

    const typeMap = {
      'application/pdf': 'pdf',
      'application/msword': 'word',
      'application/vnd.openxmlformats-officedocument.wordprocessingml.document': 'word',
      'application/vnd.ms-excel': 'excel',
      'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': 'excel',
      'text/plain': 'text',
      'text/csv': 'csv',
      'application/json': 'json',
      'application/zip': 'archive',
      'application/x-tar': 'archive'
    };

    if (mimeType.startsWith('image/')) return 'image';
    if (mimeType.startsWith('video/')) return 'video';
    if (mimeType.startsWith('audio/')) return 'audio';

    return typeMap[mimeType] || 'other';
  }
}

Advanced File Processing and Streaming

Implement sophisticated file processing capabilities:

// Advanced file processing and streaming operations
class GridFSProcessingService {
  constructor(db, fileManager) {
    this.db = db;
    this.fileManager = fileManager;
    this.processingQueue = db.collection('file_processing_queue');
  }

  async processImageFile(fileId, operations) {
    // Process image files with transformations
    const Sharp = require('sharp'); // Image processing library
    const originalBuffer = await this.fileManager.getFileBuffer(fileId);
    const originalMeta = await this.fileManager.getFileMetadata(fileId);

    const processedVersions = [];

    for (const operation of operations) {
      let processedBuffer;
      let newFilename;

      switch (operation.type) {
        case 'resize':
          processedBuffer = await Sharp(originalBuffer)
            .resize(operation.width, operation.height, {
              fit: operation.fit || 'cover',
              withoutEnlargement: true
            })
            .toBuffer();
          newFilename = `${originalMeta.filename}_${operation.width}x${operation.height}`;
          break;

        case 'thumbnail':
          processedBuffer = await Sharp(originalBuffer)
            .resize(150, 150, { fit: 'cover' })
            .jpeg({ quality: 80 })
            .toBuffer();
          newFilename = `${originalMeta.filename}_thumbnail`;
          break;

        case 'watermark':
          const watermark = await Sharp(operation.watermarkPath)
            .resize(Math.floor(operation.width * 0.3))
            .png()
            .toBuffer();

          processedBuffer = await Sharp(originalBuffer)
            .composite([{
              input: watermark,
              gravity: operation.position || 'southeast'
            }])
            .toBuffer();
          newFilename = `${originalMeta.filename}_watermarked`;
          break;

        case 'format_conversion':
          const sharpInstance = Sharp(originalBuffer);

          switch (operation.format) {
            case 'jpeg':
              processedBuffer = await sharpInstance.jpeg({ quality: operation.quality || 85 }).toBuffer();
              break;
            case 'png':
              processedBuffer = await sharpInstance.png({ compressionLevel: operation.compression || 6 }).toBuffer();
              break;
            case 'webp':
              processedBuffer = await sharpInstance.webp({ quality: operation.quality || 80 }).toBuffer();
              break;
          }
          newFilename = `${originalMeta.filename}.${operation.format}`;
          break;
      }

      // Upload processed version
      const processedFile = await this.fileManager.uploadFromBuffer(
        processedBuffer,
        newFilename,
        {
          ...originalMeta.metadata,
          processedFrom: originalMeta.fileId,
          processingOperation: operation,
          processedAt: new Date(),
          category: 'processed-image',
          originalFileId: originalMeta.fileId
        }
      );

      processedVersions.push(processedFile);
    }

    // Update original file metadata with processing info
    await this.fileManager.updateFileMetadata(fileId, {
      processedVersions: processedVersions.map(v => v.fileId),
      processingComplete: true,
      processedAt: new Date()
    });

    return processedVersions;
  }

  async extractDocumentText(fileId) {
    // Extract text content from documents for search indexing
    const fileBuffer = await this.fileManager.getFileBuffer(fileId);
    const metadata = await this.fileManager.getFileMetadata(fileId);
    const mimeType = metadata.metadata?.mimeType;

    let extractedText = '';

    try {
      switch (mimeType) {
        case 'application/pdf':
          // PDF text extraction
          const pdfParse = require('pdf-parse');
          const pdfData = await pdfParse(fileBuffer);
          extractedText = pdfData.text;
          break;

        case 'application/vnd.openxmlformats-officedocument.wordprocessingml.document':
          // Word document text extraction
          const mammoth = require('mammoth');
          const wordResult = await mammoth.extractRawText({ buffer: fileBuffer });
          extractedText = wordResult.value;
          break;

        case 'text/plain':
        case 'text/csv':
        case 'application/json':
          // Plain text files
          extractedText = fileBuffer.toString('utf8');
          break;

        default:
          console.log(`Text extraction not supported for ${mimeType}`);
          return null;
      }

      // Store extracted text for search
      await this.fileManager.updateFileMetadata(fileId, {
        extractedText: extractedText.substring(0, 10000), // Limit stored text
        textExtracted: true,
        textExtractionDate: new Date(),
        wordCount: extractedText.split(/\s+/).length,
        characterCount: extractedText.length
      });

      // Create text search index entry
      await this.db.collection('file_text_index').insertOne({
        fileId: ObjectId(fileId),
        filename: metadata.filename,
        extractedText: extractedText,
        extractedAt: new Date(),
        metadata: metadata.metadata
      });

      return {
        fileId: fileId,
        extractedText: extractedText,
        wordCount: extractedText.split(/\s+/).length,
        characterCount: extractedText.length
      };

    } catch (error) {
      console.error(`Text extraction failed for ${fileId}:`, error);

      await this.fileManager.updateFileMetadata(fileId, {
        textExtractionFailed: true,
        textExtractionError: error.message,
        textExtractionAttempted: new Date()
      });

      return null;
    }
  }

  async createFileArchive(fileIds, archiveName) {
    // Create ZIP archive containing multiple files
    const archiver = require('archiver');
    const { PassThrough } = require('stream');

    const archive = archiver('zip', { zlib: { level: 9 } });
    const bufferStream = new PassThrough();
    const chunks = [];

    bufferStream.on('data', (chunk) => chunks.push(chunk));

    return new Promise(async (resolve, reject) => {
      bufferStream.on('end', async () => {
        const archiveBuffer = Buffer.concat(chunks);

        // Upload archive to GridFS
        const archiveFile = await this.fileManager.uploadFromBuffer(
          archiveBuffer,
          `${archiveName}.zip`,
          {
            category: 'archive',
            archiveType: 'zip',
            containedFiles: fileIds,
            createdAt: new Date(),
            fileCount: fileIds.length,
            mimeType: 'application/zip'
          }
        );

        resolve(archiveFile);
      });

      archive.on('error', reject);
      archive.pipe(bufferStream);

      // Add files to archive
      for (const fileId of fileIds) {
        const metadata = await this.fileManager.getFileMetadata(fileId);
        const fileBuffer = await this.fileManager.getFileBuffer(fileId);

        archive.append(fileBuffer, { name: metadata.filename });
      }

      archive.finalize();
    });
  }

  async streamFileRange(fileId, range) {
    // Stream partial file content (useful for video streaming, resume downloads)
    const file = await this.fileManager.getFileMetadata(fileId);
    if (!file) {
      throw new Error(`File ${fileId} not found`);
    }

    const { start = 0, end = file.length - 1 } = range;
    const chunkSize = file.chunkSize;

    const startChunk = Math.floor(start / chunkSize);
    const endChunk = Math.floor(end / chunkSize);

    // Get relevant chunks
    const chunks = await this.fileManager.chunksCollection
      .find({
        files_id: ObjectId(fileId),
        n: { $gte: startChunk, $lte: endChunk }
      })
      .sort({ n: 1 })
      .toArray();

    const { Readable } = require('stream');
    const rangeStream = new Readable({
      read() {}
    });

    // Process chunks and extract requested range
    let currentPosition = startChunk * chunkSize;

    chunks.forEach((chunk, index) => {
      const chunkData = chunk.data.buffer;

      let chunkStart = 0;
      let chunkEnd = chunkData.length;

      // Adjust for first chunk
      if (index === 0 && start > currentPosition) {
        chunkStart = start - currentPosition;
      }

      // Adjust for last chunk
      if (index === chunks.length - 1 && end < currentPosition + chunkData.length) {
        chunkEnd = end - currentPosition + 1;
      }

      if (chunkStart < chunkEnd) {
        rangeStream.push(chunkData.slice(chunkStart, chunkEnd));
      }

      currentPosition += chunkData.length;
    });

    rangeStream.push(null); // End stream

    return {
      stream: rangeStream,
      contentLength: end - start + 1,
      contentRange: `bytes ${start}-${end}/${file.length}`
    };
  }

  async scheduleFileProcessing(fileId, processingType, options = {}) {
    // Queue file for background processing
    const processingJob = {
      fileId: ObjectId(fileId),
      processingType: processingType,
      options: options,
      status: 'queued',
      createdAt: new Date(),
      attempts: 0,
      maxAttempts: options.maxAttempts || 3
    };

    await this.processingQueue.insertOne(processingJob);

    // Trigger immediate processing if requested
    if (options.immediate) {
      return await this.processQueuedJob(processingJob._id);
    }

    return processingJob;
  }

  async processQueuedJob(jobId) {
    // Process queued file processing job
    const job = await this.processingQueue.findOne({ _id: ObjectId(jobId) });
    if (!job) {
      throw new Error(`Processing job ${jobId} not found`);
    }

    try {
      // Update job status
      await this.processingQueue.updateOne(
        { _id: job._id },
        { 
          $set: { 
            status: 'processing', 
            startedAt: new Date() 
          },
          $inc: { attempts: 1 }
        }
      );

      let result;

      switch (job.processingType) {
        case 'image_processing':
          result = await this.processImageFile(job.fileId, job.options.operations);
          break;

        case 'text_extraction':
          result = await this.extractDocumentText(job.fileId);
          break;

        case 'thumbnail_generation':
          result = await this.generateThumbnail(job.fileId, job.options);
          break;

        default:
          throw new Error(`Unknown processing type: ${job.processingType}`);
      }

      // Mark job as completed
      await this.processingQueue.updateOne(
        { _id: job._id },
        { 
          $set: { 
            status: 'completed',
            completedAt: new Date(),
            result: result
          }
        }
      );

      return result;

    } catch (error) {
      // Handle job failure
      const shouldRetry = job.attempts < job.maxAttempts;

      await this.processingQueue.updateOne(
        { _id: job._id },
        {
          $set: {
            status: shouldRetry ? 'failed_retryable' : 'failed',
            lastError: error.message,
            lastAttemptAt: new Date()
          }
        }
      );

      if (!shouldRetry) {
        console.error(`Processing job ${jobId} failed permanently:`, error);
      }

      throw error;
    }
  }

  async generateThumbnail(fileId, options = {}) {
    // Generate thumbnail for various file types
    const metadata = await this.fileManager.getFileMetadata(fileId);
    const mimeType = metadata.metadata?.mimeType;
    const { width = 150, height = 150, quality = 80 } = options;

    if (mimeType && mimeType.startsWith('image/')) {
      // Image thumbnail
      return await this.processImageFile(fileId, [{
        type: 'thumbnail',
        width: width,
        height: height,
        quality: quality
      }]);
    } else if (mimeType === 'application/pdf') {
      // PDF thumbnail (first page)
      const pdf2pic = require('pdf2pic');
      const fileBuffer = await this.fileManager.getFileBuffer(fileId);

      const convert = pdf2pic.fromBuffer(fileBuffer, {
        density: 100,
        saveFilename: "page",
        savePath: "/tmp",
        format: "png",
        width: width,
        height: height
      });

      const result = await convert(1); // First page
      const thumbnailBuffer = require('fs').readFileSync(result.path);

      return await this.fileManager.uploadFromBuffer(
        thumbnailBuffer,
        `${metadata.filename}_thumbnail.png`,
        {
          category: 'thumbnail',
          thumbnailOf: fileId,
          generatedAt: new Date(),
          mimeType: 'image/png'
        }
      );
    }

    throw new Error(`Thumbnail generation not supported for ${mimeType}`);
  }
}

File Security and Access Control

Implement comprehensive file security:

// File security and access control system
class GridFSSecurityManager {
  constructor(db, fileManager) {
    this.db = db;
    this.fileManager = fileManager;
    this.accessLog = db.collection('file_access_log');
    this.permissions = db.collection('file_permissions');
  }

  async setFilePermissions(fileId, permissions) {
    // Set granular file permissions
    const permissionDoc = {
      fileId: ObjectId(fileId),
      permissions: {
        read: permissions.read || [],      // User/role IDs who can read
        write: permissions.write || [],    // User/role IDs who can modify
        delete: permissions.delete || [], // User/role IDs who can delete
        admin: permissions.admin || []     // User/role IDs who can change permissions
      },
      inheritance: permissions.inheritance || 'none', // none, folder, parent
      publicAccess: permissions.publicAccess || false,
      expiresAt: permissions.expiresAt || null,
      createdAt: new Date(),
      createdBy: permissions.createdBy
    };

    await this.permissions.replaceOne(
      { fileId: ObjectId(fileId) },
      permissionDoc,
      { upsert: true }
    );

    // Update file metadata
    await this.fileManager.updateFileMetadata(fileId, {
      hasCustomPermissions: true,
      lastPermissionUpdate: new Date()
    });

    return permissionDoc;
  }

  async checkFileAccess(fileId, userId, operation = 'read') {
    // Check if user has access to perform operation on file
    const filePerms = await this.permissions.findOne({
      fileId: ObjectId(fileId)
    });

    // Log access attempt
    await this.logAccess(fileId, userId, operation, filePerms ? 'authorized' : 'checking');

    if (!filePerms) {
      // No specific permissions - check default policy
      return await this.checkDefaultAccess(fileId, userId, operation);
    }

    // Check expiration
    if (filePerms.expiresAt && new Date() > filePerms.expiresAt) {
      await this.logAccess(fileId, userId, operation, 'expired');
      return { allowed: false, reason: 'permissions_expired' };
    }

    // Check public access
    if (filePerms.publicAccess && operation === 'read') {
      await this.logAccess(fileId, userId, operation, 'public_access');
      return { allowed: true, reason: 'public_access' };
    }

    // Check specific permissions
    const userRoles = await this.getUserRoles(userId);
    const allowedEntities = filePerms.permissions[operation] || [];

    const hasAccess = allowedEntities.some(entity => 
      entity.toString() === userId.toString() || userRoles.includes(entity.toString())
    );

    const result = { 
      allowed: hasAccess, 
      reason: hasAccess ? 'explicit_permission' : 'permission_denied',
      permissions: filePerms.permissions
    };

    await this.logAccess(fileId, userId, operation, hasAccess ? 'granted' : 'denied');
    return result;
  }

  async createSecureFileShare(fileId, shareConfig) {
    // Create secure, time-limited file share
    const shareToken = this.generateSecureToken();
    const shareDoc = {
      fileId: ObjectId(fileId),
      shareToken: shareToken,
      sharedBy: shareConfig.sharedBy,
      sharedAt: new Date(),
      expiresAt: shareConfig.expiresAt || new Date(Date.now() + 7 * 24 * 60 * 60 * 1000), // 7 days

      // Access restrictions
      maxDownloads: shareConfig.maxDownloads || null,
      currentDownloads: 0,
      allowedIPs: shareConfig.allowedIPs || [],
      requirePassword: shareConfig.password ? true : false,
      passwordHash: shareConfig.password ? this.hashPassword(shareConfig.password) : null,

      // Permissions
      allowDownload: shareConfig.allowDownload !== false,
      allowView: shareConfig.allowView !== false,
      allowPreview: shareConfig.allowPreview !== false,

      // Tracking
      accessLog: [],
      isActive: true
    };

    await this.db.collection('file_shares').insertOne(shareDoc);

    // Generate secure share URL
    const shareUrl = `${process.env.BASE_URL}/shared/${shareToken}`;

    return {
      shareToken: shareToken,
      shareUrl: shareUrl,
      expiresAt: shareDoc.expiresAt,
      shareId: shareDoc._id
    };
  }

  async accessSharedFile(shareToken, clientIP, password = null) {
    // Access file through secure share
    const share = await this.db.collection('file_shares').findOne({
      shareToken: shareToken,
      isActive: true,
      expiresAt: { $gt: new Date() }
    });

    if (!share) {
      return { success: false, error: 'share_not_found_or_expired' };
    }

    // Check download limit
    if (share.maxDownloads && share.currentDownloads >= share.maxDownloads) {
      return { success: false, error: 'download_limit_exceeded' };
    }

    // Check IP restrictions
    if (share.allowedIPs.length > 0 && !share.allowedIPs.includes(clientIP)) {
      return { success: false, error: 'ip_not_allowed' };
    }

    // Check password
    if (share.requirePassword) {
      if (!password || !this.verifyPassword(password, share.passwordHash)) {
        return { success: false, error: 'invalid_password' };
      }
    }

    // Update access tracking
    await this.db.collection('file_shares').updateOne(
      { _id: share._id },
      {
        $inc: { currentDownloads: 1 },
        $push: {
          accessLog: {
            accessedAt: new Date(),
            clientIP: clientIP,
            userAgent: 'unknown' // Could be passed as parameter
          }
        }
      }
    );

    return {
      success: true,
      fileId: share.fileId,
      permissions: {
        allowDownload: share.allowDownload,
        allowView: share.allowView,
        allowPreview: share.allowPreview
      }
    };
  }

  async encryptFile(fileId, encryptionKey) {
    // Encrypt file content (in-place)
    const crypto = require('crypto');
    const originalBuffer = await this.fileManager.getFileBuffer(fileId);
    const metadata = await this.fileManager.getFileMetadata(fileId);

    // Generate encryption parameters
    const iv = crypto.randomBytes(16);
    const cipher = crypto.createCipher('aes-256-gcm', encryptionKey);

    const encryptedBuffer = Buffer.concat([
      cipher.update(originalBuffer),
      cipher.final()
    ]);

    const authTag = cipher.getAuthTag();

    // Upload encrypted version
    const encryptedFile = await this.fileManager.uploadFromBuffer(
      encryptedBuffer,
      `${metadata.filename}.encrypted`,
      {
        ...metadata.metadata,
        encrypted: true,
        encryptionAlgorithm: 'aes-256-gcm',
        encryptionIV: iv.toString('hex'),
        authTag: authTag.toString('hex'),
        originalFileId: fileId,
        encryptedAt: new Date()
      }
    );

    // Delete original file if requested
    // await this.fileManager.deleteFile(fileId);

    return {
      encryptedFileId: encryptedFile.fileId,
      encryptionInfo: {
        algorithm: 'aes-256-gcm',
        iv: iv.toString('hex'),
        authTag: authTag.toString('hex')
      }
    };
  }

  async decryptFile(encryptedFileId, encryptionKey) {
    // Decrypt encrypted file
    const crypto = require('crypto');
    const encryptedBuffer = await this.fileManager.getFileBuffer(encryptedFileId);
    const metadata = await this.fileManager.getFileMetadata(encryptedFileId);

    if (!metadata.metadata.encrypted) {
      throw new Error('File is not encrypted');
    }

    const iv = Buffer.from(metadata.metadata.encryptionIV, 'hex');
    const authTag = Buffer.from(metadata.metadata.authTag, 'hex');

    const decipher = crypto.createDecipher('aes-256-gcm', encryptionKey);
    decipher.setAuthTag(authTag);

    try {
      const decryptedBuffer = Buffer.concat([
        decipher.update(encryptedBuffer),
        decipher.final()
      ]);

      return decryptedBuffer;
    } catch (error) {
      throw new Error('Decryption failed - invalid key or corrupted file');
    }
  }

  async logAccess(fileId, userId, operation, status) {
    // Log file access for audit trail
    await this.accessLog.insertOne({
      fileId: ObjectId(fileId),
      userId: userId ? ObjectId(userId) : null,
      operation: operation,
      status: status,
      timestamp: new Date(),
      ipAddress: 'unknown', // Could be passed as parameter
      userAgent: 'unknown'  // Could be passed as parameter
    });
  }

  async getFileAccessLog(fileId, options = {}) {
    // Get access log for specific file
    const query = { fileId: ObjectId(fileId) };

    if (options.dateRange) {
      query.timestamp = {
        $gte: new Date(options.dateRange.start),
        $lte: new Date(options.dateRange.end)
      };
    }

    const accessEntries = await this.accessLog
      .find(query)
      .sort({ timestamp: -1 })
      .limit(options.limit || 100)
      .toArray();

    return accessEntries;
  }

  async getUserRoles(userId) {
    // Get user roles for permission checking
    // This would typically integrate with your user management system
    const user = await this.db.collection('users').findOne({
      _id: ObjectId(userId)
    });

    return user?.roles || [];
  }

  generateSecureToken() {
    // Generate cryptographically secure random token
    const crypto = require('crypto');
    return crypto.randomBytes(32).toString('hex');
  }

  hashPassword(password) {
    // Hash password for secure storage
    const crypto = require('crypto');
    const salt = crypto.randomBytes(16).toString('hex');
    const hash = crypto.pbkdf2Sync(password, salt, 10000, 64, 'sha512').toString('hex');
    return `${salt}:${hash}`;
  }

  verifyPassword(password, hash) {
    // Verify password against hash
    const crypto = require('crypto');
    const [salt, originalHash] = hash.split(':');
    const verifyHash = crypto.pbkdf2Sync(password, salt, 10000, 64, 'sha512').toString('hex');
    return originalHash === verifyHash;
  }

  async checkDefaultAccess(fileId, userId, operation) {
    // Default access policy when no specific permissions set
    // This would be customized based on your application's security model
    return { 
      allowed: operation === 'read', // Default: allow read, deny write/delete
      reason: 'default_policy'
    };
  }
}

SQL-Style File Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB GridFS operations:

-- QueryLeaf GridFS operations with SQL-familiar syntax

-- Upload file with metadata (SQL-style INSERT)
INSERT INTO FILES (filename, content, metadata)
VALUES (
  'annual-report-2024.pdf',
  LOAD_FILE('/path/to/annual-report-2024.pdf'),
  JSON_BUILD_OBJECT(
    'title', 'Annual Report 2024',
    'category', 'financial-reports',
    'department', 'finance',
    'confidentiality', 'internal',
    'tags', ARRAY['annual', 'report', '2024', 'finance'],
    'uploadedBy', '64f1a2c4567890abcdef1234',
    'uploadedByName', 'John Smith',
    'approvalStatus', 'pending'
  )
);

-- Query files with metadata filtering (SQL-style SELECT)
SELECT 
  file_id,
  filename,
  length as file_size,
  FORMAT_BYTES(length) as human_size,
  upload_date,
  md5_hash,
  metadata->>'title' as document_title,
  metadata->>'category' as category,
  metadata->>'department' as department,
  metadata->'tags' as tags
FROM FILES
WHERE metadata->>'category' = 'financial-reports'
  AND metadata->>'department' = 'finance'
  AND upload_date >= CURRENT_DATE - INTERVAL '1 year'
  AND length > 1024 * 1024  -- Files larger than 1MB
ORDER BY upload_date DESC;

-- Search files by content and metadata
SELECT 
  f.file_id,
  f.filename,
  f.length,
  f.metadata->>'title' as title,
  f.metadata->>'category' as category,
  CASE 
    WHEN f.metadata->>'confidentiality' = 'public' THEN 'Public'
    WHEN f.metadata->>'confidentiality' = 'internal' THEN 'Internal'
    WHEN f.metadata->>'confidentiality' = 'confidential' THEN 'Confidential'
    ELSE 'Unknown'
  END as access_level
FROM FILES f
WHERE (f.filename ILIKE '%report%' 
   OR f.metadata->>'title' ILIKE '%financial%'
   OR f.metadata->'tags' @> '["quarterly"]')
  AND f.metadata->>'approvalStatus' = 'approved'
ORDER BY f.upload_date DESC
LIMIT 50;

-- File operations with streaming and processing
-- Download file content
SELECT 
  file_id,
  filename,
  DOWNLOAD_FILE(file_id) as file_content,
  metadata
FROM FILES
WHERE file_id = '64f1a2c4567890abcdef1234';

-- Stream file content in chunks (for large files)
SELECT 
  file_id,
  filename,
  STREAM_FILE_RANGE(file_id, 0, 1048576) as chunk_content,  -- First 1MB
  FORMAT_BYTES(length) as total_size
FROM FILES
WHERE file_id = '64f1a2c4567890abcdef1234';

-- File processing operations
-- Generate thumbnail for image file
INSERT INTO FILES (filename, content, metadata)
SELECT 
  CONCAT(REPLACE(filename, '.', '_thumbnail.'), 'jpg'),
  GENERATE_THUMBNAIL(file_id, 150, 150) as thumbnail_content,
  JSON_BUILD_OBJECT(
    'category', 'thumbnail',
    'thumbnailOf', file_id,
    'generatedAt', CURRENT_TIMESTAMP,
    'dimensions', JSON_BUILD_OBJECT('width', 150, 'height', 150)
  )
FROM FILES
WHERE metadata->>'category' = 'image'
  AND file_id = '64f1a2c4567890abcdef1234';

-- Extract text content from documents
UPDATE FILES
SET metadata = metadata || JSON_BUILD_OBJECT(
  'extractedText', EXTRACT_TEXT(file_id),
  'textExtracted', true,
  'textExtractionDate', CURRENT_TIMESTAMP,
  'wordCount', WORD_COUNT(EXTRACT_TEXT(file_id))
)
WHERE metadata->>'mimeType' IN ('application/pdf', 'application/msword')
  AND (metadata->>'textExtracted')::boolean IS NOT TRUE;

-- File analytics and statistics
SELECT 
  metadata->>'category' as file_category,
  COUNT(*) as file_count,
  SUM(length) as total_bytes,
  FORMAT_BYTES(SUM(length)) as total_size,
  AVG(length) as avg_file_size,
  FORMAT_BYTES(AVG(length)) as avg_human_size,
  MIN(upload_date) as oldest_file,
  MAX(upload_date) as newest_file
FROM FILES
WHERE upload_date >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY metadata->>'category'
ORDER BY total_bytes DESC;

-- File access permissions and security
-- Set file permissions
UPDATE FILES
SET metadata = metadata || JSON_BUILD_OBJECT(
  'permissions', JSON_BUILD_OBJECT(
    'read', ARRAY['finance', 'management', 'audit'],
    'write', ARRAY['finance'],
    'admin', ARRAY['finance-manager']
  ),
  'hasCustomPermissions', true,
  'lastPermissionUpdate', CURRENT_TIMESTAMP
)
WHERE file_id = '64f1a2c4567890abcdef1234';

-- Check user access to files
SELECT 
  f.file_id,
  f.filename,
  f.metadata->>'title' as title,
  CHECK_FILE_ACCESS(f.file_id, '64f1a2c4567890abcdef9999', 'read') as can_read,
  CHECK_FILE_ACCESS(f.file_id, '64f1a2c4567890abcdef9999', 'write') as can_write,
  CHECK_FILE_ACCESS(f.file_id, '64f1a2c4567890abcdef9999', 'delete') as can_delete
FROM FILES f
WHERE f.metadata->>'category' = 'financial-reports'
ORDER BY f.upload_date DESC;

-- Create secure file sharing links
INSERT INTO file_shares (
  file_id,
  share_token,
  shared_by,
  expires_at,
  max_downloads,
  allow_download,
  created_at
)
SELECT 
  file_id,
  GENERATE_SECURE_TOKEN() as share_token,
  '64f1a2c4567890abcdef1111' as shared_by,
  CURRENT_TIMESTAMP + INTERVAL '7 days' as expires_at,
  5 as max_downloads,
  true as allow_download,
  CURRENT_TIMESTAMP
FROM FILES
WHERE file_id = '64f1a2c4567890abcdef1234';

-- File versioning and history
-- Create file version
INSERT INTO FILES (filename, content, metadata)
SELECT 
  filename,
  content,
  metadata || JSON_BUILD_OBJECT(
    'version', COALESCE((metadata->>'version')::numeric, 0) + 1,
    'previousVersion', file_id,
    'versionCreated', CURRENT_TIMESTAMP,
    'versionCreatedBy', '64f1a2c4567890abcdef2222'
  )
FROM FILES
WHERE file_id = '64f1a2c4567890abcdef1234';

-- Get file version history
WITH file_versions AS (
  SELECT 
    file_id,
    filename,
    upload_date,
    metadata->>'version' as version,
    metadata->>'previousVersion' as previous_version,
    metadata->>'versionCreatedBy' as created_by,
    length
  FROM FILES
  WHERE filename = 'annual-report-2024.pdf'
    OR metadata->>'previousVersion' = '64f1a2c4567890abcdef1234'
)
SELECT 
  file_id,
  version,
  upload_date as version_date,
  created_by,
  FORMAT_BYTES(length) as file_size,
  LAG(version, 1) OVER (ORDER BY version) as previous_version
FROM file_versions
ORDER BY version DESC;

-- Bulk file operations
-- Archive old files by moving to archive category
UPDATE FILES
SET metadata = metadata || JSON_BUILD_OBJECT(
  'category', 'archived',
  'archivedAt', CURRENT_TIMESTAMP,
  'originalCategory', metadata->>'category'
)
WHERE upload_date < CURRENT_DATE - INTERVAL '2 years'
  AND metadata->>'category' NOT IN ('archived', 'permanent');

-- Create ZIP archive of related files
INSERT INTO FILES (filename, content, metadata)
SELECT 
  'financial-reports-2024.zip' as filename,
  CREATE_ZIP_ARCHIVE(ARRAY_AGG(file_id)) as content,
  JSON_BUILD_OBJECT(
    'category', 'archive',
    'archiveType', 'zip',
    'containedFiles', ARRAY_AGG(file_id),
    'fileCount', COUNT(*),
    'createdAt', CURRENT_TIMESTAMP
  ) as metadata
FROM FILES
WHERE metadata->>'category' = 'financial-reports'
  AND upload_date >= '2024-01-01'
  AND upload_date < '2025-01-01';

-- File duplicate detection and cleanup
WITH file_duplicates AS (
  SELECT 
    md5_hash,
    COUNT(*) as duplicate_count,
    ARRAY_AGG(file_id ORDER BY upload_date) as file_ids,
    ARRAY_AGG(filename ORDER BY upload_date) as filenames,
    MIN(upload_date) as first_uploaded,
    MAX(upload_date) as last_uploaded,
    SUM(length) as total_wasted_space
  FROM FILES
  GROUP BY md5_hash
  HAVING COUNT(*) > 1
)
SELECT 
  md5_hash,
  duplicate_count,
  filenames[1] as original_filename,
  first_uploaded,
  last_uploaded,
  FORMAT_BYTES(total_wasted_space) as wasted_space,
  -- Get file IDs to keep (first) and delete (rest)
  file_ids[1] as keep_file_id,
  file_ids[2:] as delete_file_ids
FROM file_duplicates
ORDER BY total_wasted_space DESC;

-- Storage optimization and maintenance
SELECT 
  -- Storage usage by category
  'storage_by_category' as metric_type,
  metadata->>'category' as category,
  COUNT(*) as file_count,
  SUM(length) as total_bytes,
  FORMAT_BYTES(SUM(length)) as total_size,
  ROUND((SUM(length)::float / (SELECT SUM(length) FROM FILES)) * 100, 2) as percentage_of_total
FROM FILES
GROUP BY metadata->>'category'

UNION ALL

SELECT 
  -- Large files analysis
  'large_files' as metric_type,
  'files_over_100mb' as category,
  COUNT(*) as file_count,
  SUM(length) as total_bytes,
  FORMAT_BYTES(SUM(length)) as total_size,
  ROUND((SUM(length)::float / (SELECT SUM(length) FROM FILES)) * 100, 2) as percentage_of_total
FROM FILES
WHERE length > 100 * 1024 * 1024

UNION ALL

SELECT 
  -- Old files analysis
  'old_files' as metric_type,
  'files_over_1_year_old' as category,
  COUNT(*) as file_count,
  SUM(length) as total_bytes,
  FORMAT_BYTES(SUM(length)) as total_size,
  ROUND((SUM(length)::float / (SELECT SUM(length) FROM FILES)) * 100, 2) as percentage_of_total
FROM FILES
WHERE upload_date < CURRENT_DATE - INTERVAL '1 year'

ORDER BY metric_type, total_bytes DESC;

-- QueryLeaf provides comprehensive GridFS functionality:
-- 1. SQL-familiar file upload and download operations
-- 2. Rich metadata querying and filtering capabilities
-- 3. File processing functions (thumbnails, text extraction)
-- 4. Access control and permission management
-- 5. File versioning and history tracking
-- 6. Bulk operations and archive creation
-- 7. Storage analytics and optimization tools
-- 8. Duplicate detection and cleanup operations
-- 9. Secure file sharing with expiration controls
-- 10. Integration with MongoDB's native GridFS capabilities

Best Practices for GridFS File Storage

Design Guidelines

Essential practices for effective GridFS usage:

Chunk Size Optimization: Choose appropriate chunk sizes based on file types and usage patterns
Metadata Design: Structure metadata for efficient querying and filtering
Index Strategy: Create proper indexes on metadata fields for fast file discovery
Security Implementation: Implement proper access controls and permission systems
Storage Management: Monitor storage usage and implement lifecycle policies
Performance Optimization: Use appropriate connection pooling and streaming techniques

Use Case Selection

Choose GridFS for appropriate scenarios:

Large File Storage: Files larger than 16MB that exceed BSON document limits
Media Management: Images, videos, audio files with rich metadata requirements
Document Management: PDF, Word, Excel files with content indexing needs
Backup Storage: Database backups, system archives with metadata tracking
User-Generated Content: Profile images, file uploads with permission controls
Binary Data Integration: Any binary data that benefits from database integration

Conclusion

MongoDB GridFS provides powerful, native file storage capabilities that eliminate the complexity of external file systems while delivering sophisticated metadata management, security controls, and processing capabilities. Combined with SQL-familiar file operations, GridFS enables comprehensive binary data solutions that integrate seamlessly with your document database.

Key GridFS benefits include:

Native Integration: Built-in file storage without external dependencies
Unlimited Size: Handle files of any size with automatic chunking and streaming
Rich Metadata: Comprehensive metadata storage with full query capabilities
Transactional Consistency: Atomic operations across file data and metadata
Built-in Replication: Automatic file replication with your database infrastructure

Whether you're building document management systems, media platforms, content management solutions, or applications requiring large binary data storage, MongoDB GridFS with QueryLeaf's familiar SQL interface provides the foundation for scalable file storage. This combination enables you to implement sophisticated file management capabilities while preserving familiar database interaction patterns.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB GridFS operations while providing SQL-familiar file upload, download, and query syntax. Complex file processing, metadata management, and security controls are seamlessly handled through familiar SQL patterns, making advanced file storage both powerful and accessible.

The integration of native file storage with SQL-style operations makes MongoDB an ideal platform for applications requiring both sophisticated file management and familiar database interaction patterns, ensuring your file storage solutions remain both effective and maintainable as they scale and evolve.

September 10, 2025
25 min read

MongoDB Capped Collections: High-Performance Circular Buffers with SQL-Style Fixed-Size Data Management

Modern applications generate massive amounts of streaming data - logs, events, metrics, chat messages, and real-time analytics data. Traditional database approaches struggle with the dual challenge of high-throughput write operations and automatic data lifecycle management. Storing unlimited streaming data leads to storage bloat, performance degradation, and complex data retention policies.

MongoDB capped collections provide a specialized solution for high-volume, time-ordered data by implementing fixed-size circular buffers at the database level. Unlike traditional tables that grow indefinitely, capped collections automatically maintain a fixed size by overwriting the oldest documents when capacity limits are reached, delivering predictable performance characteristics and eliminating the need for complex data purging mechanisms.

The High-Volume Data Challenge

Traditional approaches to streaming data storage have significant limitations:

-- Traditional SQL log table - grows indefinitely
CREATE TABLE application_logs (
    log_id BIGSERIAL PRIMARY KEY,
    timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    level VARCHAR(10) NOT NULL,
    service_name VARCHAR(50) NOT NULL,
    message TEXT NOT NULL,
    metadata JSONB,
    request_id UUID,
    user_id INTEGER,
    session_id VARCHAR(100),

    INDEX idx_timestamp (timestamp),
    INDEX idx_level (level),
    INDEX idx_service (service_name)
);

-- High-volume insertions
INSERT INTO application_logs (level, service_name, message, metadata, request_id, user_id)
VALUES 
    ('INFO', 'auth-service', 'User login successful', '{"ip": "192.168.1.100", "browser": "Chrome"}', uuid_generate_v4(), 12345),
    ('ERROR', 'payment-service', 'Payment processing failed', '{"amount": 99.99, "currency": "USD", "error_code": "CARD_DECLINED"}', uuid_generate_v4(), 67890),
    ('DEBUG', 'api-gateway', 'Request routed to microservice', '{"path": "/api/v1/users", "method": "GET", "response_time": 45}', uuid_generate_v4(), 11111);

-- Problems with unlimited growth:
-- 1. Table size grows indefinitely requiring manual cleanup
-- 2. Performance degrades as table size increases
-- 3. Index maintenance overhead scales with data volume
-- 4. Complex retention policies need external scheduling
-- 5. Storage costs increase without bounds
-- 6. Backup and maintenance times increase linearly

-- Manual cleanup required with complex scheduling
DELETE FROM application_logs 
WHERE timestamp < NOW() - INTERVAL '30 days';

-- Problems with manual cleanup:
-- - Requires scheduled maintenance scripts
-- - DELETE operations can cause table locks
-- - Index fragmentation after large deletions
-- - Uneven performance during cleanup windows
-- - Risk of accidentally deleting important data
-- - Complex retention rules difficult to implement

MongoDB capped collections solve these challenges automatically:

// MongoDB capped collection - automatic size management
// Create capped collection with automatic circular buffer behavior
db.createCollection("application_logs", {
  capped: true,
  size: 100 * 1024 * 1024, // 100MB maximum size
  max: 50000,              // Maximum 50,000 documents
  autoIndexId: false       // Optimize for insert performance
});

// High-performance insertions with guaranteed order preservation
db.application_logs.insertMany([
  {
    timestamp: new Date(),
    level: "INFO",
    serviceName: "auth-service",
    message: "User login successful",
    metadata: {
      ip: "192.168.1.100",
      browser: "Chrome",
      responseTime: 23
    },
    requestId: "req_001",
    userId: 12345,
    sessionId: "sess_abc123"
  },
  {
    timestamp: new Date(),
    level: "ERROR", 
    serviceName: "payment-service",
    message: "Payment processing failed",
    metadata: {
      amount: 99.99,
      currency: "USD",
      errorCode: "CARD_DECLINED",
      attemptNumber: 2
    },
    requestId: "req_002",
    userId: 67890
  },
  {
    timestamp: new Date(),
    level: "DEBUG",
    serviceName: "api-gateway", 
    message: "Request routed to microservice",
    metadata: {
      path: "/api/v1/users",
      method: "GET",
      responseTime: 45,
      upstreamService: "user-service"
    },
    requestId: "req_003",
    userId: 11111
  }
]);

// Benefits of capped collections:
// - Fixed size with automatic circular buffer behavior
// - Guaranteed insert order preservation (natural order)
// - High-performance insertions (no index maintenance overhead)
// - Automatic data lifecycle management (oldest data removed automatically)
// - Predictable performance characteristics regardless of data volume
// - No manual cleanup or maintenance required
// - Optimized for append-only workloads
// - Built-in tailable cursor support for real-time streaming

Understanding MongoDB Capped Collections

Capped Collection Fundamentals

Implement high-performance capped collections for various use cases:

// Comprehensive capped collection management system
class CappedCollectionManager {
  constructor(db) {
    this.db = db;
    this.collections = new Map();
    this.tailableCursors = new Map();
  }

  async createLogCollection(serviceName, options = {}) {
    // Create service-specific log collection
    const collectionName = `${serviceName}_logs`;
    const defaultOptions = {
      size: 50 * 1024 * 1024,  // 50MB default size
      max: 25000,              // 25K documents default
      autoIndexId: false       // Optimize for pure append workload
    };

    const cappedOptions = {
      capped: true,
      ...defaultOptions,
      ...options
    };

    try {
      // Create the capped collection
      await this.db.createCollection(collectionName, cappedOptions);

      // Store collection configuration
      this.collections.set(collectionName, {
        type: 'logs',
        service: serviceName,
        options: cappedOptions,
        createdAt: new Date(),
        totalInserts: 0,
        lastActivity: new Date()
      });

      console.log(`Created capped log collection: ${collectionName}`, cappedOptions);
      return this.db.collection(collectionName);

    } catch (error) {
      if (error.code === 48) { // Collection already exists
        console.log(`Capped collection ${collectionName} already exists`);
        return this.db.collection(collectionName);
      }
      throw error;
    }
  }

  async createMetricsCollection(metricType, options = {}) {
    // Create high-frequency metrics collection
    const collectionName = `metrics_${metricType}`;
    const metricsOptions = {
      capped: true,
      size: 200 * 1024 * 1024, // 200MB for metrics data
      max: 100000,             // 100K metric documents
      autoIndexId: false
    };

    const collection = await this.db.createCollection(collectionName, {
      ...metricsOptions,
      ...options
    });

    this.collections.set(collectionName, {
      type: 'metrics',
      metricType: metricType,
      options: metricsOptions,
      createdAt: new Date(),
      totalInserts: 0,
      lastActivity: new Date()
    });

    return collection;
  }

  async createEventStreamCollection(streamName, options = {}) {
    // Create event streaming collection for real-time processing
    const collectionName = `events_${streamName}`;
    const eventOptions = {
      capped: true,
      size: 100 * 1024 * 1024, // 100MB for event stream
      max: 50000,              // 50K events
      autoIndexId: false
    };

    const collection = await this.db.createCollection(collectionName, {
      ...eventOptions,
      ...options
    });

    this.collections.set(collectionName, {
      type: 'events',
      streamName: streamName,
      options: eventOptions,
      createdAt: new Date(),
      totalInserts: 0,
      lastActivity: new Date()
    });

    return collection;
  }

  async logMessage(serviceName, logData) {
    // High-performance logging with automatic batching
    const collectionName = `${serviceName}_logs`;
    let collection = this.db.collection(collectionName);

    // Create collection if it doesn't exist
    if (!this.collections.has(collectionName)) {
      collection = await this.createLogCollection(serviceName);
    }

    // Prepare log document with required fields
    const logDocument = {
      timestamp: logData.timestamp || new Date(),
      level: logData.level || 'INFO',
      serviceName: serviceName,
      message: logData.message,

      // Optional structured data
      metadata: logData.metadata || {},
      requestId: logData.requestId || null,
      userId: logData.userId || null,
      sessionId: logData.sessionId || null,
      traceId: logData.traceId || null,
      spanId: logData.spanId || null,

      // Performance tracking
      hostname: logData.hostname || require('os').hostname(),
      processId: process.pid,
      threadId: logData.threadId || 0,

      // Categorization
      category: logData.category || 'general',
      tags: logData.tags || []
    };

    // Insert with fire-and-forget for maximum performance
    await collection.insertOne(logDocument, { 
      writeConcern: { w: 0 } // Fire-and-forget for logs
    });

    // Update collection statistics
    const collectionInfo = this.collections.get(collectionName);
    if (collectionInfo) {
      collectionInfo.totalInserts++;
      collectionInfo.lastActivity = new Date();
    }

    return logDocument._id;
  }

  async writeMetrics(metricType, metricsData) {
    // High-throughput metrics writing
    const collectionName = `metrics_${metricType}`;
    let collection = this.db.collection(collectionName);

    if (!this.collections.has(collectionName)) {
      collection = await this.createMetricsCollection(metricType);
    }

    // Prepare metrics document
    const metricsDocument = {
      timestamp: metricsData.timestamp || new Date(),
      metricType: metricType,

      // Metric values
      values: metricsData.values || {},

      // Dimensions for grouping and filtering
      dimensions: {
        service: metricsData.service,
        environment: metricsData.environment || 'production',
        region: metricsData.region || 'us-east-1',
        version: metricsData.version || '1.0.0',
        ...metricsData.dimensions
      },

      // Aggregation-friendly structure
      counters: metricsData.counters || {},
      gauges: metricsData.gauges || {},
      histograms: metricsData.histograms || {},
      timers: metricsData.timers || {},

      // Source information
      source: {
        hostname: metricsData.hostname || require('os').hostname(),
        processId: process.pid,
        collectionId: metricsData.collectionId || null
      }
    };

    // Batch insertion for metrics (multiple metrics per call)
    if (Array.isArray(metricsData)) {
      const documents = metricsData.map(data => ({
        timestamp: data.timestamp || new Date(),
        metricType: metricType,
        values: data.values || {},
        dimensions: { ...data.dimensions },
        counters: data.counters || {},
        gauges: data.gauges || {},
        histograms: data.histograms || {},
        timers: data.timers || {},
        source: {
          hostname: data.hostname || require('os').hostname(),
          processId: process.pid,
          collectionId: data.collectionId || null
        }
      }));

      await collection.insertMany(documents, { 
        ordered: false, // Allow partial success
        writeConcern: { w: 0 }
      });

      return documents.length;
    } else {
      await collection.insertOne(metricsDocument, { 
        writeConcern: { w: 0 }
      });

      return 1;
    }
  }

  async publishEvent(streamName, eventData) {
    // Event streaming with guaranteed order preservation
    const collectionName = `events_${streamName}`;
    let collection = this.db.collection(collectionName);

    if (!this.collections.has(collectionName)) {
      collection = await this.createEventStreamCollection(streamName);
    }

    const eventDocument = {
      timestamp: eventData.timestamp || new Date(),
      eventId: eventData.eventId || new ObjectId(),
      eventType: eventData.eventType,
      streamName: streamName,

      // Event payload
      data: eventData.data || {},

      // Event metadata
      metadata: {
        version: eventData.version || '1.0',
        source: eventData.source || 'unknown',
        correlationId: eventData.correlationId || null,
        causationId: eventData.causationId || null,
        userId: eventData.userId || null,
        sessionId: eventData.sessionId || null,
        ...eventData.metadata
      },

      // Event context
      context: {
        service: eventData.service || 'unknown',
        environment: eventData.environment || 'production',
        hostname: require('os').hostname(),
        processId: process.pid,
        requestId: eventData.requestId || null
      }
    };

    // Events may need acknowledgment
    const result = await collection.insertOne(eventDocument, {
      writeConcern: { w: 1, j: true } // Ensure durability for events
    });

    return {
      eventId: eventDocument.eventId,
      insertedId: result.insertedId,
      timestamp: eventDocument.timestamp
    };
  }

  async queryRecentLogs(serviceName, options = {}) {
    // Query recent logs with natural ordering (insertion order)
    const collectionName = `${serviceName}_logs`;
    const collection = this.db.collection(collectionName);

    const query = {};

    // Add filters
    if (options.level) {
      query.level = options.level;
    }

    if (options.since) {
      query.timestamp = { $gte: options.since };
    }

    if (options.userId) {
      query.userId = options.userId;
    }

    if (options.category) {
      query.category = options.category;
    }

    // Use natural ordering for efficiency (no index needed)
    const cursor = collection.find(query);

    if (options.reverse) {
      // Get most recent first (reverse natural order)
      cursor.sort({ $natural: -1 });
    }

    if (options.limit) {
      cursor.limit(options.limit);
    }

    const logs = await cursor.toArray();

    return {
      logs: logs,
      count: logs.length,
      service: serviceName,
      query: query,
      options: options
    };
  }

  async getMetricsAggregation(metricType, timeRange, aggregationType = 'avg') {
    // Efficient metrics aggregation over time ranges
    const collectionName = `metrics_${metricType}`;
    const collection = this.db.collection(collectionName);

    const pipeline = [
      {
        $match: {
          timestamp: {
            $gte: timeRange.start,
            $lte: timeRange.end
          }
        }
      },
      {
        $group: {
          _id: {
            service: '$dimensions.service',
            environment: '$dimensions.environment',
            // Group by time bucket for time-series analysis
            timeBucket: {
              $dateTrunc: {
                date: '$timestamp',
                unit: timeRange.bucketSize || 'minute',
                binSize: timeRange.binSize || 1
              }
            }
          },

          // Aggregate different metric types
          avgValues: { $avg: '$values' },
          maxValues: { $max: '$values' },
          minValues: { $min: '$values' },
          sumCounters: { $sum: '$counters' },

          count: { $sum: 1 },

          firstTimestamp: { $min: '$timestamp' },
          lastTimestamp: { $max: '$timestamp' }
        }
      },
      {
        $sort: {
          '_id.timeBucket': 1,
          '_id.service': 1
        }
      },
      {
        $project: {
          service: '$_id.service',
          environment: '$_id.environment',
          timeBucket: '$_id.timeBucket',

          aggregatedValue: {
            $switch: {
              branches: [
                { case: { $eq: [aggregationType, 'avg'] }, then: '$avgValues' },
                { case: { $eq: [aggregationType, 'max'] }, then: '$maxValues' },
                { case: { $eq: [aggregationType, 'min'] }, then: '$minValues' },
                { case: { $eq: [aggregationType, 'sum'] }, then: '$sumCounters' }
              ],
              default: '$avgValues'
            }
          },

          dataPoints: '$count',
          timeRange: {
            start: '$firstTimestamp',
            end: '$lastTimestamp'
          },

          _id: 0
        }
      }
    ];

    const results = await collection.aggregate(pipeline).toArray();

    return {
      metricType: metricType,
      aggregationType: aggregationType,
      timeRange: timeRange,
      results: results,
      totalDataPoints: results.reduce((sum, r) => sum + r.dataPoints, 0)
    };
  }

  async createTailableCursor(collectionName, options = {}) {
    // Create tailable cursor for real-time streaming
    const collection = this.db.collection(collectionName);

    // Verify collection is capped
    const collectionInfo = await this.db.command({
      collStats: collectionName
    });

    if (!collectionInfo.capped) {
      throw new Error(`Collection ${collectionName} is not capped - tailable cursors require capped collections`);
    }

    const query = options.filter || {};
    const cursorOptions = {
      tailable: true,      // Don't close cursor when reaching end
      awaitData: true,     // Block briefly when no data available
      noCursorTimeout: true, // Don't timeout cursor
      maxTimeMS: options.maxTimeMS || 1000,
      batchSize: options.batchSize || 100
    };

    const cursor = collection.find(query, cursorOptions);

    // Store cursor reference for management
    const cursorId = `${collectionName}_${Date.now()}`;
    this.tailableCursors.set(cursorId, {
      cursor: cursor,
      collection: collectionName,
      filter: query,
      createdAt: new Date(),
      lastActivity: new Date()
    });

    return {
      cursorId: cursorId,
      cursor: cursor
    };
  }

  async streamData(collectionName, callback, options = {}) {
    // High-level streaming interface with automatic reconnection
    const { cursor, cursorId } = await this.createTailableCursor(collectionName, options);

    console.log(`Starting data stream from ${collectionName}`);

    try {
      while (await cursor.hasNext()) {
        const document = await cursor.next();

        if (document) {
          // Update last activity
          const cursorInfo = this.tailableCursors.get(cursorId);
          if (cursorInfo) {
            cursorInfo.lastActivity = new Date();
          }

          // Process document through callback
          try {
            await callback(document, { 
              collection: collectionName,
              cursorId: cursorId 
            });
          } catch (callbackError) {
            console.error('Stream callback error:', callbackError);
            // Continue streaming despite callback errors
          }
        }
      }
    } catch (streamError) {
      console.error(`Stream error for ${collectionName}:`, streamError);

      // Cleanup cursor reference
      this.tailableCursors.delete(cursorId);

      // Auto-reconnect for network errors
      if (streamError.name === 'MongoNetworkError' && options.autoReconnect !== false) {
        console.log(`Attempting to reconnect stream for ${collectionName}...`);
        setTimeout(() => {
          this.streamData(collectionName, callback, options);
        }, options.reconnectDelay || 5000);
      }

      throw streamError;
    }
  }

  async getCappedCollectionStats(collectionName) {
    // Get comprehensive statistics for capped collection
    const stats = await this.db.command({
      collStats: collectionName,
      indexDetails: true
    });

    const collection = this.db.collection(collectionName);

    // Get document count and size information
    const documentCount = await collection.estimatedDocumentCount();
    const newestDoc = await collection.findOne({}, { sort: { $natural: -1 } });
    const oldestDoc = await collection.findOne({}, { sort: { $natural: 1 } });

    return {
      collection: collectionName,
      capped: stats.capped,

      // Size information
      maxSize: stats.maxSize,
      size: stats.size,
      storageSize: stats.storageSize,
      sizeUtilization: stats.size / stats.maxSize,

      // Document information
      maxDocuments: stats.max,
      documentCount: documentCount,
      avgDocumentSize: documentCount > 0 ? stats.size / documentCount : 0,
      documentUtilization: stats.max ? documentCount / stats.max : null,

      // Time range information
      timespan: newestDoc && oldestDoc ? {
        oldest: oldestDoc.timestamp || oldestDoc._id.getTimestamp(),
        newest: newestDoc.timestamp || newestDoc._id.getTimestamp(),
        spanMs: newestDoc && oldestDoc ? 
          (newestDoc.timestamp || newestDoc._id.getTimestamp()).getTime() - 
          (oldestDoc.timestamp || oldestDoc._id.getTimestamp()).getTime() : 0
      } : null,

      // Performance information
      indexes: stats.indexSizes,
      totalIndexSize: Object.values(stats.indexSizes).reduce((sum, size) => sum + size, 0),

      // Collection metadata
      collectionInfo: this.collections.get(collectionName) || null,

      analyzedAt: new Date()
    };
  }

  async optimizeCappedCollection(collectionName, analysisOptions = {}) {
    // Analyze and provide optimization recommendations
    const stats = await this.getCappedCollectionStats(collectionName);
    const recommendations = [];

    // Size utilization analysis
    if (stats.sizeUtilization < 0.5) {
      recommendations.push({
        type: 'size_optimization',
        priority: 'medium',
        message: `Collection is only ${(stats.sizeUtilization * 100).toFixed(1)}% full. Consider reducing maxSize to save storage.`,
        suggestedMaxSize: Math.ceil(stats.size * 1.2) // 20% headroom
      });
    }

    if (stats.sizeUtilization > 0.9) {
      recommendations.push({
        type: 'size_warning',
        priority: 'high',
        message: `Collection is ${(stats.sizeUtilization * 100).toFixed(1)}% full. Consider increasing maxSize to prevent data loss.`,
        suggestedMaxSize: Math.ceil(stats.maxSize * 1.5) // 50% increase
      });
    }

    // Document count analysis
    if (stats.documentUtilization && stats.documentUtilization < 0.5) {
      recommendations.push({
        type: 'document_optimization',
        priority: 'low',
        message: `Only ${(stats.documentUtilization * 100).toFixed(1)}% of max documents used. Consider reducing max document limit.`,
        suggestedMaxDocs: Math.ceil(stats.documentCount * 1.2)
      });
    }

    // Document size analysis
    if (stats.avgDocumentSize > 10 * 1024) { // 10KB average
      recommendations.push({
        type: 'document_size_warning',
        priority: 'medium',
        message: `Average document size is ${(stats.avgDocumentSize / 1024).toFixed(1)}KB. Large documents may impact performance in capped collections.`
      });
    }

    // Index analysis
    if (stats.totalIndexSize > stats.size * 0.2) { // Indexes > 20% of data size
      recommendations.push({
        type: 'index_optimization',
        priority: 'medium',
        message: `Index size (${(stats.totalIndexSize / 1024 / 1024).toFixed(1)}MB) is large relative to data size. Consider if all indexes are necessary for capped collection workload.`
      });
    }

    // Time span analysis
    if (stats.timespan && stats.timespan.spanMs < 60 * 60 * 1000) { // Less than 1 hour
      recommendations.push({
        type: 'retention_warning',
        priority: 'high',
        message: `Data retention span is only ${(stats.timespan.spanMs / (60 * 1000)).toFixed(1)} minutes. Consider increasing collection size for longer data retention.`
      });
    }

    return {
      collectionStats: stats,
      recommendations: recommendations,
      optimizationScore: this.calculateOptimizationScore(stats, recommendations),
      analyzedAt: new Date()
    };
  }

  calculateOptimizationScore(stats, recommendations) {
    // Calculate optimization score (0-100, higher is better)
    let score = 100;

    // Deduct points for each recommendation based on priority
    recommendations.forEach(rec => {
      switch (rec.priority) {
        case 'high':
          score -= 30;
          break;
        case 'medium':
          score -= 15;
          break;
        case 'low':
          score -= 5;
          break;
      }
    });

    // Bonus points for good utilization
    if (stats.sizeUtilization >= 0.6 && stats.sizeUtilization <= 0.8) {
      score += 10; // Good size utilization
    }

    if (stats.avgDocumentSize < 5 * 1024) { // < 5KB average
      score += 5; // Good document size
    }

    return Math.max(0, Math.min(100, score));
  }

  async closeTailableCursor(cursorId) {
    // Safely close tailable cursor
    const cursorInfo = this.tailableCursors.get(cursorId);

    if (cursorInfo) {
      try {
        await cursorInfo.cursor.close();
      } catch (error) {
        console.error(`Error closing cursor ${cursorId}:`, error);
      }

      this.tailableCursors.delete(cursorId);
      console.log(`Closed tailable cursor: ${cursorId}`);
    }
  }

  async cleanup() {
    // Cleanup all tailable cursors
    const cursors = Array.from(this.tailableCursors.keys());

    for (const cursorId of cursors) {
      await this.closeTailableCursor(cursorId);
    }

    console.log(`Cleaned up ${cursors.length} tailable cursors`);
  }
}

Real-Time Streaming with Tailable Cursors

Implement real-time data processing with MongoDB's tailable cursors:

// Real-time streaming and event processing with tailable cursors
class RealTimeStreamProcessor {
  constructor(db) {
    this.db = db;
    this.cappedManager = new CappedCollectionManager(db);
    this.activeStreams = new Map();
    this.eventHandlers = new Map();
  }

  async setupLogStreaming(services = []) {
    // Setup real-time log streaming for multiple services
    for (const service of services) {
      await this.cappedManager.createLogCollection(service, {
        size: 100 * 1024 * 1024, // 100MB per service
        max: 50000
      });

      // Start streaming logs for this service
      this.startLogStream(service);
    }
  }

  async startLogStream(serviceName) {
    const collectionName = `${serviceName}_logs`;

    console.log(`Starting log stream for ${serviceName}...`);

    // Create stream processor
    const streamProcessor = async (logDocument, streamContext) => {
      try {
        // Process log based on level
        await this.processLogMessage(logDocument, streamContext);

        // Trigger alerts for critical logs
        if (logDocument.level === 'ERROR' || logDocument.level === 'FATAL') {
          await this.handleCriticalLog(logDocument);
        }

        // Update real-time metrics
        await this.updateLogMetrics(serviceName, logDocument);

        // Forward to external systems if needed
        if (this.eventHandlers.has('log_processed')) {
          await this.eventHandlers.get('log_processed')(logDocument);
        }

      } catch (processingError) {
        console.error('Log processing error:', processingError);
      }
    };

    // Start the stream
    const streamPromise = this.cappedManager.streamData(
      collectionName,
      streamProcessor,
      {
        autoReconnect: true,
        reconnectDelay: 5000,
        batchSize: 50
      }
    );

    this.activeStreams.set(serviceName, streamPromise);
  }

  async processLogMessage(logDocument, streamContext) {
    // Real-time log message processing
    const processing = {
      timestamp: new Date(),
      service: logDocument.serviceName,
      level: logDocument.level,
      messageLength: logDocument.message.length,
      hasMetadata: Object.keys(logDocument.metadata || {}).length > 0,
      processingLatency: Date.now() - logDocument.timestamp.getTime()
    };

    // Pattern matching for specific log types
    if (logDocument.message.includes('OutOfMemoryError')) {
      await this.handleOutOfMemoryAlert(logDocument);
    }

    if (logDocument.message.includes('Connection timeout')) {
      await this.handleConnectionIssue(logDocument);
    }

    if (logDocument.requestId && logDocument.level === 'ERROR') {
      await this.trackRequestError(logDocument);
    }

    // Store processing metadata for analytics
    await this.db.collection('log_processing_stats').insertOne({
      ...processing,
      logId: logDocument._id
    });
  }

  async handleCriticalLog(logDocument) {
    // Handle critical log events
    const alert = {
      timestamp: new Date(),
      alertType: 'critical_log',
      severity: logDocument.level,
      service: logDocument.serviceName,
      message: logDocument.message,
      metadata: logDocument.metadata,

      // Context information
      requestId: logDocument.requestId,
      userId: logDocument.userId,
      sessionId: logDocument.sessionId,

      // Alert details
      alertId: new ObjectId(),
      acknowledged: false,
      escalated: false
    };

    // Store alert
    await this.db.collection('critical_alerts').insertOne(alert);

    // Send notifications (implement based on your notification system)
    await this.sendAlertNotification(alert);

    // Auto-escalate if needed
    if (logDocument.level === 'FATAL') {
      setTimeout(async () => {
        await this.escalateAlert(alert.alertId);
      }, 5 * 60 * 1000); // Escalate after 5 minutes if not acknowledged
    }
  }

  async setupMetricsStreaming(metricTypes = []) {
    // Setup real-time metrics streaming
    for (const metricType of metricTypes) {
      await this.cappedManager.createMetricsCollection(metricType, {
        size: 200 * 1024 * 1024, // 200MB per metric type
        max: 100000
      });

      this.startMetricsStream(metricType);
    }
  }

  async startMetricsStream(metricType) {
    const collectionName = `metrics_${metricType}`;

    const metricsProcessor = async (metricsDocument, streamContext) => {
      try {
        // Real-time metrics processing
        await this.processMetricsData(metricsDocument, streamContext);

        // Check for threshold violations
        await this.checkMetricsThresholds(metricsDocument);

        // Update real-time dashboards
        if (this.eventHandlers.has('metrics_updated')) {
          await this.eventHandlers.get('metrics_updated')(metricsDocument);
        }

        // Aggregate into time-series buckets
        await this.aggregateMetricsData(metricsDocument);

      } catch (processingError) {
        console.error('Metrics processing error:', processingError);
      }
    };

    const streamPromise = this.cappedManager.streamData(
      collectionName,
      metricsProcessor,
      {
        autoReconnect: true,
        batchSize: 100,
        filter: { 
          // Only process metrics from last 5 minutes to avoid historical data on restart
          timestamp: { $gte: new Date(Date.now() - 5 * 60 * 1000) }
        }
      }
    );

    this.activeStreams.set(`metrics_${metricType}`, streamPromise);
  }

  async processMetricsData(metricsDocument, streamContext) {
    // Process individual metrics document
    const metricType = metricsDocument.metricType;
    const values = metricsDocument.values || {};
    const counters = metricsDocument.counters || {};
    const gauges = metricsDocument.gauges || {};

    // Calculate derived metrics
    const derivedMetrics = {
      timestamp: metricsDocument.timestamp,
      metricType: metricType,
      service: metricsDocument.dimensions?.service,

      // Calculate rates and percentages
      rates: {},
      percentages: {},
      health: {}
    };

    // Calculate request rate if applicable
    if (counters.requests) {
      const timeWindow = 60; // 1 minute window
      const requestRate = counters.requests / timeWindow;
      derivedMetrics.rates.requestsPerSecond = requestRate;
    }

    // Calculate error percentage
    if (counters.requests && counters.errors) {
      derivedMetrics.percentages.errorRate = (counters.errors / counters.requests) * 100;
    }

    // Calculate response time percentiles if histogram data available
    if (metricsDocument.histograms?.response_time) {
      derivedMetrics.responseTime = this.calculatePercentiles(
        metricsDocument.histograms.response_time
      );
    }

    // Health scoring
    derivedMetrics.health.score = this.calculateHealthScore(metricsDocument);
    derivedMetrics.health.status = this.getHealthStatus(derivedMetrics.health.score);

    // Store derived metrics
    await this.db.collection('derived_metrics').insertOne(derivedMetrics);
  }

  async checkMetricsThresholds(metricsDocument) {
    // Check metrics against defined thresholds
    const thresholds = await this.getThresholdsForService(
      metricsDocument.dimensions?.service
    );

    const violations = [];

    // Check various threshold types
    Object.entries(thresholds.counters || {}).forEach(([metric, threshold]) => {
      const value = metricsDocument.counters?.[metric];
      if (value !== undefined && value > threshold.max) {
        violations.push({
          type: 'counter',
          metric: metric,
          value: value,
          threshold: threshold.max,
          severity: threshold.severity || 'warning'
        });
      }
    });

    Object.entries(thresholds.gauges || {}).forEach(([metric, threshold]) => {
      const value = metricsDocument.gauges?.[metric];
      if (value !== undefined) {
        if (threshold.max && value > threshold.max) {
          violations.push({
            type: 'gauge_high',
            metric: metric,
            value: value,
            threshold: threshold.max,
            severity: threshold.severity || 'warning'
          });
        }
        if (threshold.min && value < threshold.min) {
          violations.push({
            type: 'gauge_low',
            metric: metric,
            value: value,
            threshold: threshold.min,
            severity: threshold.severity || 'warning'
          });
        }
      }
    });

    // Handle threshold violations
    for (const violation of violations) {
      await this.handleThresholdViolation(violation, metricsDocument);
    }
  }

  async setupEventStreaming(streamNames = []) {
    // Setup event streaming for event-driven architectures
    for (const streamName of streamNames) {
      await this.cappedManager.createEventStreamCollection(streamName, {
        size: 100 * 1024 * 1024,
        max: 50000
      });

      this.startEventStream(streamName);
    }
  }

  async startEventStream(streamName) {
    const collectionName = `events_${streamName}`;

    const eventProcessor = async (eventDocument, streamContext) => {
      try {
        // Process event based on type
        await this.processEvent(eventDocument, streamContext);

        // Trigger event handlers
        const eventType = eventDocument.eventType;
        if (this.eventHandlers.has(eventType)) {
          await this.eventHandlers.get(eventType)(eventDocument);
        }

        // Update event processing metrics
        await this.updateEventMetrics(streamName, eventDocument);

      } catch (processingError) {
        console.error('Event processing error:', processingError);

        // Handle event processing failure
        await this.handleEventProcessingFailure(eventDocument, processingError);
      }
    };

    const streamPromise = this.cappedManager.streamData(
      collectionName,
      eventProcessor,
      {
        autoReconnect: true,
        batchSize: 25 // Smaller batches for events to reduce latency
      }
    );

    this.activeStreams.set(`events_${streamName}`, streamPromise);
  }

  async processEvent(eventDocument, streamContext) {
    // Process individual event
    const eventType = eventDocument.eventType;
    const eventData = eventDocument.data;
    const eventMetadata = eventDocument.metadata;

    // Event processing based on type
    switch (eventType) {
      case 'user_action':
        await this.processUserActionEvent(eventDocument);
        break;

      case 'system_state_change':
        await this.processSystemStateEvent(eventDocument);
        break;

      case 'transaction_completed':
        await this.processTransactionEvent(eventDocument);
        break;

      case 'alert_triggered':
        await this.processAlertEvent(eventDocument);
        break;

      default:
        await this.processGenericEvent(eventDocument);
    }

    // Store event processing record
    await this.db.collection('event_processing_log').insertOne({
      eventId: eventDocument.eventId,
      eventType: eventType,
      streamName: eventDocument.streamName,
      processedAt: new Date(),
      processingLatency: Date.now() - eventDocument.timestamp.getTime(),
      success: true
    });
  }

  // Event handler registration
  registerEventHandler(eventType, handler) {
    this.eventHandlers.set(eventType, handler);
  }

  unregisterEventHandler(eventType) {
    this.eventHandlers.delete(eventType);
  }

  // Utility methods
  async getThresholdsForService(serviceName) {
    // Get threshold configuration for service
    const config = await this.db.collection('service_thresholds').findOne({
      service: serviceName
    });

    return config?.thresholds || {
      counters: {},
      gauges: {},
      histograms: {}
    };
  }

  calculatePercentiles(histogramData) {
    // Calculate percentiles from histogram data
    // Implementation depends on histogram format
    return {
      p50: 0,
      p90: 0,
      p95: 0,
      p99: 0
    };
  }

  calculateHealthScore(metricsDocument) {
    // Calculate overall health score from metrics
    let score = 100;

    // Deduct based on error rates, response times, etc.
    const errorRate = metricsDocument.counters?.errors / metricsDocument.counters?.requests;
    if (errorRate > 0.05) score -= 30; // > 5% error rate
    if (errorRate > 0.01) score -= 15; // > 1% error rate

    return Math.max(0, score);
  }

  getHealthStatus(score) {
    if (score >= 90) return 'healthy';
    if (score >= 70) return 'warning';
    if (score >= 50) return 'critical';
    return 'unhealthy';
  }

  async handleThresholdViolation(violation, metricsDocument) {
    // Handle metrics threshold violations
    console.log(`Threshold violation: ${violation.metric} = ${violation.value} (threshold: ${violation.threshold})`);

    // Store violation record
    await this.db.collection('threshold_violations').insertOne({
      ...violation,
      timestamp: new Date(),
      service: metricsDocument.dimensions?.service,
      environment: metricsDocument.dimensions?.environment,
      metricsDocument: metricsDocument._id
    });
  }

  async handleEventProcessingFailure(eventDocument, error) {
    // Handle event processing failures
    await this.db.collection('event_processing_errors').insertOne({
      eventId: eventDocument.eventId,
      eventType: eventDocument.eventType,
      streamName: eventDocument.streamName,
      error: error.message,
      errorStack: error.stack,
      failedAt: new Date(),
      retryCount: 0
    });
  }

  // Cleanup and shutdown
  async stopAllStreams() {
    const streamPromises = Array.from(this.activeStreams.values());

    // Stop all active streams
    for (const [streamName, streamPromise] of this.activeStreams.entries()) {
      console.log(`Stopping stream: ${streamName}`);
      // Streams will stop when cursors are closed
    }

    await this.cappedManager.cleanup();
    this.activeStreams.clear();

    console.log(`Stopped ${streamPromises.length} streams`);
  }

  // Placeholder methods for event processing
  async processUserActionEvent(eventDocument) { /* Implementation */ }
  async processSystemStateEvent(eventDocument) { /* Implementation */ }
  async processTransactionEvent(eventDocument) { /* Implementation */ }
  async processAlertEvent(eventDocument) { /* Implementation */ }
  async processGenericEvent(eventDocument) { /* Implementation */ }
  async updateLogMetrics(serviceName, logDocument) { /* Implementation */ }
  async updateEventMetrics(streamName, eventDocument) { /* Implementation */ }
  async sendAlertNotification(alert) { /* Implementation */ }
  async escalateAlert(alertId) { /* Implementation */ }
  async handleOutOfMemoryAlert(logDocument) { /* Implementation */ }
  async handleConnectionIssue(logDocument) { /* Implementation */ }
  async trackRequestError(logDocument) { /* Implementation */ }
  async aggregateMetricsData(metricsDocument) { /* Implementation */ }
}

Performance Monitoring and Chat Systems

Implement specialized capped collection patterns for different use cases:

// Specialized capped collection implementations
class SpecializedCappedSystems {
  constructor(db) {
    this.db = db;
    this.cappedManager = new CappedCollectionManager(db);
  }

  async setupChatSystem(channelId, options = {}) {
    // High-performance chat message storage
    const collectionName = `chat_${channelId}`;
    const chatOptions = {
      capped: true,
      size: 50 * 1024 * 1024,  // 50MB per channel
      max: 25000,              // 25K messages per channel
      autoIndexId: false
    };

    const collection = await this.db.createCollection(collectionName, {
      ...chatOptions,
      ...options
    });

    // Setup for real-time message streaming
    await this.setupChatStreaming(channelId);

    return collection;
  }

  async sendChatMessage(channelId, messageData) {
    const collectionName = `chat_${channelId}`;

    const messageDocument = {
      timestamp: new Date(),
      messageId: new ObjectId(),
      channelId: channelId,

      // Message content
      content: messageData.content,
      messageType: messageData.type || 'text', // text, image, file, system

      // Sender information
      sender: {
        userId: messageData.senderId,
        username: messageData.senderUsername,
        avatar: messageData.senderAvatar || null
      },

      // Message metadata
      metadata: {
        edited: false,
        editHistory: [],
        reactions: {},
        mentions: messageData.mentions || [],
        attachments: messageData.attachments || [],
        threadParent: messageData.threadParent || null
      },

      // Moderation
      flagged: false,
      deleted: false,
      moderatorActions: []
    };

    const collection = this.db.collection(collectionName);
    await collection.insertOne(messageDocument);

    return messageDocument;
  }

  async getChatHistory(channelId, options = {}) {
    const collectionName = `chat_${channelId}`;
    const collection = this.db.collection(collectionName);

    const query = { deleted: { $ne: true } };

    if (options.since) {
      query.timestamp = { $gte: options.since };
    }

    if (options.before) {
      query.timestamp = { ...query.timestamp, $lt: options.before };
    }

    // Use natural order for chat (insertion order)
    const messages = await collection
      .find(query)
      .sort({ $natural: options.reverse ? -1 : 1 })
      .limit(options.limit || 50)
      .toArray();

    return {
      channelId: channelId,
      messages: messages,
      count: messages.length,
      hasMore: messages.length === (options.limit || 50)
    };
  }

  async setupChatStreaming(channelId) {
    const collectionName = `chat_${channelId}`;

    const messageProcessor = async (messageDocument, streamContext) => {
      // Process new chat messages in real-time
      try {
        // Broadcast to connected users (implement based on your WebSocket system)
        await this.broadcastChatMessage(messageDocument);

        // Update user activity
        await this.updateUserActivity(messageDocument.sender.userId, channelId);

        // Check for mentions and notifications
        if (messageDocument.metadata.mentions.length > 0) {
          await this.handleMentionNotifications(messageDocument);
        }

        // Content moderation
        await this.moderateMessage(messageDocument);

      } catch (error) {
        console.error('Chat message processing error:', error);
      }
    };

    // Start real-time streaming for this channel
    return this.cappedManager.streamData(collectionName, messageProcessor, {
      autoReconnect: true,
      batchSize: 10, // Small batches for low latency
      filter: { deleted: { $ne: true } } // Don't stream deleted messages
    });
  }

  async setupPerformanceMonitoring(applicationName, options = {}) {
    // Application performance monitoring with capped collections
    const collectionName = `perf_${applicationName}`;
    const perfOptions = {
      capped: true,
      size: 500 * 1024 * 1024, // 500MB for performance data
      max: 200000,             // 200K performance records
      autoIndexId: false
    };

    const collection = await this.db.createCollection(collectionName, {
      ...perfOptions,
      ...options
    });

    // Setup real-time performance monitoring
    await this.setupPerformanceStreaming(applicationName);

    return collection;
  }

  async recordPerformanceMetrics(applicationName, performanceData) {
    const collectionName = `perf_${applicationName}`;

    const performanceDocument = {
      timestamp: new Date(),
      application: applicationName,

      // Request/response metrics
      request: {
        method: performanceData.method,
        path: performanceData.path,
        userAgent: performanceData.userAgent,
        ip: performanceData.ip,
        size: performanceData.requestSize || 0
      },

      response: {
        statusCode: performanceData.statusCode,
        size: performanceData.responseSize || 0,
        contentType: performanceData.contentType
      },

      // Timing metrics
      timings: {
        total: performanceData.responseTime,
        dns: performanceData.dnsTime || 0,
        connect: performanceData.connectTime || 0,
        ssl: performanceData.sslTime || 0,
        send: performanceData.sendTime || 0,
        wait: performanceData.waitTime || 0,
        receive: performanceData.receiveTime || 0
      },

      // Performance indicators
      performance: {
        cpuUsage: performanceData.cpuUsage,
        memoryUsage: performanceData.memoryUsage,
        diskIO: performanceData.diskIO || {},
        networkIO: performanceData.networkIO || {}
      },

      // Error tracking
      errors: performanceData.errors || [],
      warnings: performanceData.warnings || [],

      // User session info
      session: {
        userId: performanceData.userId,
        sessionId: performanceData.sessionId,
        isFirstVisit: performanceData.isFirstVisit || false
      },

      // Geographic and device info
      context: {
        country: performanceData.country,
        city: performanceData.city,
        device: performanceData.device,
        os: performanceData.os,
        browser: performanceData.browser
      }
    };

    const collection = this.db.collection(collectionName);
    await collection.insertOne(performanceDocument);

    return performanceDocument;
  }

  async setupPerformanceStreaming(applicationName) {
    const collectionName = `perf_${applicationName}`;

    const performanceProcessor = async (perfDocument, streamContext) => {
      try {
        // Real-time performance analysis
        await this.analyzePerformanceData(perfDocument);

        // Detect performance anomalies
        await this.detectPerformanceAnomalies(perfDocument);

        // Update real-time dashboards
        await this.updatePerformanceDashboard(perfDocument);

      } catch (error) {
        console.error('Performance data processing error:', error);
      }
    };

    return this.cappedManager.streamData(collectionName, performanceProcessor, {
      autoReconnect: true,
      batchSize: 50
    });
  }

  async setupAuditLogging(systemName, options = {}) {
    // High-integrity audit logging with capped collections
    const collectionName = `audit_${systemName}`;
    const auditOptions = {
      capped: true,
      size: 1024 * 1024 * 1024, // 1GB for audit logs
      max: 500000,              // 500K audit records
      autoIndexId: false
    };

    const collection = await this.db.createCollection(collectionName, {
      ...auditOptions,
      ...options
    });

    return collection;
  }

  async recordAuditEvent(systemName, auditData) {
    const collectionName = `audit_${systemName}`;

    const auditDocument = {
      timestamp: new Date(),
      system: systemName,
      auditId: new ObjectId(),

      // Event information
      event: {
        type: auditData.eventType,
        action: auditData.action,
        resource: auditData.resource,
        resourceId: auditData.resourceId,
        outcome: auditData.outcome || 'success' // success, failure, pending
      },

      // Actor information
      actor: {
        userId: auditData.userId,
        username: auditData.username,
        role: auditData.userRole,
        ip: auditData.ip,
        userAgent: auditData.userAgent
      },

      // Context
      context: {
        sessionId: auditData.sessionId,
        requestId: auditData.requestId,
        correlationId: auditData.correlationId,
        source: auditData.source || 'application'
      },

      // Data changes (for modification events)
      changes: {
        before: auditData.beforeData || null,
        after: auditData.afterData || null,
        fields: auditData.changedFields || []
      },

      // Security context
      security: {
        riskScore: auditData.riskScore || 0,
        securityFlags: auditData.securityFlags || [],
        authenticationMethod: auditData.authMethod
      },

      // Compliance tags
      compliance: {
        regulations: auditData.regulations || [], // GDPR, SOX, HIPAA, etc.
        dataClassification: auditData.dataClassification || 'internal',
        retentionPolicy: auditData.retentionPolicy
      }
    };

    const collection = this.db.collection(collectionName);

    // Use acknowledged write for audit logs
    await collection.insertOne(auditDocument, {
      writeConcern: { w: 1, j: true }
    });

    return auditDocument;
  }

  async queryAuditLogs(systemName, criteria = {}) {
    const collectionName = `audit_${systemName}`;
    const collection = this.db.collection(collectionName);

    const query = {};

    if (criteria.userId) {
      query['actor.userId'] = criteria.userId;
    }

    if (criteria.eventType) {
      query['event.type'] = criteria.eventType;
    }

    if (criteria.resource) {
      query['event.resource'] = criteria.resource;
    }

    if (criteria.timeRange) {
      query.timestamp = {
        $gte: criteria.timeRange.start,
        $lte: criteria.timeRange.end
      };
    }

    if (criteria.outcome) {
      query['event.outcome'] = criteria.outcome;
    }

    const auditEvents = await collection
      .find(query)
      .sort({ $natural: -1 }) // Most recent first
      .limit(criteria.limit || 100)
      .toArray();

    return {
      system: systemName,
      criteria: criteria,
      events: auditEvents,
      count: auditEvents.length
    };
  }

  // Utility methods for chat system
  async broadcastChatMessage(messageDocument) {
    // Implement WebSocket broadcasting
    console.log(`Broadcasting message ${messageDocument.messageId} to channel ${messageDocument.channelId}`);
  }

  async updateUserActivity(userId, channelId) {
    // Update user activity in regular collection
    await this.db.collection('user_activity').updateOne(
      { userId: userId },
      {
        $set: { lastSeen: new Date() },
        $addToSet: { activeChannels: channelId }
      },
      { upsert: true }
    );
  }

  async handleMentionNotifications(messageDocument) {
    // Handle user mentions
    for (const mentionedUser of messageDocument.metadata.mentions) {
      await this.db.collection('notifications').insertOne({
        userId: mentionedUser.userId,
        type: 'mention',
        channelId: messageDocument.channelId,
        messageId: messageDocument.messageId,
        fromUser: messageDocument.sender.userId,
        timestamp: new Date(),
        read: false
      });
    }
  }

  async moderateMessage(messageDocument) {
    // Basic content moderation
    const content = messageDocument.content.toLowerCase();
    const hasInappropriateContent = false; // Implement your moderation logic

    if (hasInappropriateContent) {
      await this.db.collection(`chat_${messageDocument.channelId}`).updateOne(
        { _id: messageDocument._id },
        { 
          $set: { 
            flagged: true,
            'moderatorActions': [{
              action: 'flagged',
              reason: 'inappropriate_content',
              timestamp: new Date(),
              automated: true
            }]
          }
        }
      );
    }
  }

  // Performance monitoring methods
  async analyzePerformanceData(perfDocument) {
    // Analyze performance data for patterns
    const responseTime = perfDocument.timings.total;
    const statusCode = perfDocument.response.statusCode;

    // Calculate performance score
    let score = 100;
    if (responseTime > 5000) score -= 50; // > 5 seconds
    else if (responseTime > 2000) score -= 30; // > 2 seconds
    else if (responseTime > 1000) score -= 15; // > 1 second

    if (statusCode >= 500) score -= 40;
    else if (statusCode >= 400) score -= 20;

    // Store analysis
    await this.db.collection('performance_analysis').insertOne({
      performanceId: perfDocument._id,
      application: perfDocument.application,
      timestamp: perfDocument.timestamp,
      score: Math.max(0, score),
      responseTime: responseTime,
      statusCode: statusCode,
      performanceCategory: this.categorizePerformance(responseTime),
      analyzedAt: new Date()
    });
  }

  categorizePerformance(responseTime) {
    if (responseTime < 200) return 'excellent';
    if (responseTime < 1000) return 'good';
    if (responseTime < 3000) return 'acceptable';
    if (responseTime < 10000) return 'poor';
    return 'unacceptable';
  }

  async detectPerformanceAnomalies(perfDocument) {
    // Simple anomaly detection
    const responseTime = perfDocument.timings.total;
    const path = perfDocument.request.path;

    // Get historical average for this endpoint
    const recentPerformance = await this.db.collection(`perf_${perfDocument.application}`)
      .find({
        'request.path': path,
        timestamp: { $gte: new Date(Date.now() - 60 * 60 * 1000) } // Last hour
      })
      .sort({ $natural: -1 })
      .limit(100)
      .toArray();

    if (recentPerformance.length > 10) {
      const avgResponseTime = recentPerformance.reduce((sum, perf) => 
        sum + perf.timings.total, 0) / recentPerformance.length;

      // Alert if current response time is 3x average
      if (responseTime > avgResponseTime * 3) {
        await this.db.collection('performance_alerts').insertOne({
          application: perfDocument.application,
          path: path,
          currentResponseTime: responseTime,
          averageResponseTime: avgResponseTime,
          severity: responseTime > avgResponseTime * 5 ? 'critical' : 'warning',
          timestamp: new Date()
        });
      }
    }
  }

  async updatePerformanceDashboard(perfDocument) {
    // Update real-time performance dashboard
    console.log(`Performance update: ${perfDocument.application} ${perfDocument.request.path} - ${perfDocument.timings.total}ms`);
  }
}

SQL-Style Capped Collection Management with QueryLeaf

QueryLeaf provides familiar SQL approaches to MongoDB capped collection operations:

-- QueryLeaf capped collection operations with SQL-familiar syntax

-- Create capped collection equivalent to CREATE TABLE with size limits
CREATE CAPPED COLLECTION application_logs
WITH (
  MAX_SIZE = 104857600,    -- 100MB maximum size  
  MAX_DOCUMENTS = 50000,   -- Maximum document count
  AUTO_INDEX_ID = false    -- Optimize for insert performance
);

-- High-performance insertions equivalent to INSERT statements
INSERT INTO application_logs (
  timestamp,
  level,
  service_name,
  message,
  metadata,
  request_id,
  user_id
) VALUES (
  CURRENT_TIMESTAMP,
  'INFO',
  'auth-service',
  'User login successful',
  JSON_BUILD_OBJECT('ip', '192.168.1.100', 'browser', 'Chrome'),
  'req_001',
  12345
);

-- Query recent logs with natural order (insertion order preserved)
SELECT 
  timestamp,
  level,
  service_name,
  message,
  metadata->>'ip' as client_ip,
  request_id
FROM application_logs
WHERE level IN ('ERROR', 'FATAL', 'WARN')
  AND timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
ORDER BY $NATURAL DESC  -- MongoDB natural order (insertion order)
LIMIT 100;

-- Capped collection statistics and monitoring
SELECT 
  COLLECTION_NAME() as collection,
  IS_CAPPED() as is_capped,
  MAX_SIZE() as max_size_bytes,
  CURRENT_SIZE() as current_size_bytes,
  ROUND((CURRENT_SIZE()::float / MAX_SIZE()) * 100, 2) as size_utilization_pct,

  MAX_DOCUMENTS() as max_documents,
  DOCUMENT_COUNT() as current_documents,
  ROUND((DOCUMENT_COUNT()::float / MAX_DOCUMENTS()) * 100, 2) as doc_utilization_pct,

  -- Time span information
  MIN(timestamp) as oldest_record,
  MAX(timestamp) as newest_record,
  EXTRACT(EPOCH FROM (MAX(timestamp) - MIN(timestamp))) / 3600 as timespan_hours

FROM application_logs;

-- Real-time streaming with SQL-style tailable cursor
DECLARE @stream_cursor TAILABLE CURSOR FOR
SELECT 
  timestamp,
  level,
  service_name,
  message,
  request_id,
  user_id
FROM application_logs
WHERE level IN ('ERROR', 'FATAL')
ORDER BY $NATURAL ASC;

-- Process streaming data (pseudo-code for real-time processing)
WHILE CURSOR_HAS_NEXT(@stream_cursor)
BEGIN
  FETCH NEXT FROM @stream_cursor INTO @log_record;

  -- Process log record
  IF @log_record.level = 'FATAL'
    EXEC send_alert_notification @log_record;

  -- Update real-time metrics
  UPDATE real_time_metrics 
  SET error_count = error_count + 1,
      last_error_time = @log_record.timestamp
  WHERE service_name = @log_record.service_name;
END;

-- Performance monitoring with capped collections
CREATE CAPPED COLLECTION performance_metrics
WITH (
  MAX_SIZE = 524288000,    -- 500MB
  MAX_DOCUMENTS = 200000,
  AUTO_INDEX_ID = false
);

-- Record performance data
INSERT INTO performance_metrics (
  timestamp,
  application,
  request_method,
  request_path,
  response_time_ms,
  status_code,
  cpu_usage_pct,
  memory_usage_mb,
  user_id
) VALUES (
  CURRENT_TIMESTAMP,
  'web-app',
  'GET',
  '/api/v1/users',
  234,
  200,
  45.2,
  512,
  12345
);

-- Real-time performance analysis
WITH performance_window AS (
  SELECT 
    application,
    request_path,
    response_time_ms,
    status_code,
    timestamp,
    -- Calculate performance percentiles over sliding window
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY response_time_ms) 
      OVER (PARTITION BY request_path ORDER BY timestamp 
            ROWS BETWEEN 99 PRECEDING AND CURRENT ROW) as p50_response_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_time_ms) 
      OVER (PARTITION BY request_path ORDER BY timestamp 
            ROWS BETWEEN 99 PRECEDING AND CURRENT ROW) as p95_response_time,
    -- Error rate calculation
    SUM(CASE WHEN status_code >= 400 THEN 1 ELSE 0 END) 
      OVER (PARTITION BY request_path ORDER BY timestamp 
            ROWS BETWEEN 99 PRECEDING AND CURRENT ROW) as error_count,
    COUNT(*) 
      OVER (PARTITION BY request_path ORDER BY timestamp 
            ROWS BETWEEN 99 PRECEDING AND CURRENT ROW) as total_requests
  FROM performance_metrics
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '5 minutes'
)
SELECT 
  application,
  request_path,
  ROUND(AVG(response_time_ms), 0) as avg_response_time,
  ROUND(MAX(p50_response_time), 0) as median_response_time,
  ROUND(MAX(p95_response_time), 0) as p95_response_time,
  ROUND((MAX(error_count)::float / MAX(total_requests)) * 100, 2) as error_rate_pct,
  COUNT(*) as sample_size,

  -- Performance health assessment
  CASE 
    WHEN MAX(p95_response_time) > 5000 THEN 'CRITICAL'
    WHEN MAX(p95_response_time) > 2000 THEN 'WARNING'
    WHEN (MAX(error_count)::float / MAX(total_requests)) > 0.05 THEN 'WARNING'
    ELSE 'HEALTHY'
  END as health_status

FROM performance_window
WHERE total_requests >= 10  -- Minimum sample size
GROUP BY application, request_path
ORDER BY p95_response_time DESC;

-- Chat system with capped collections
CREATE CAPPED COLLECTION chat_general
WITH (
  MAX_SIZE = 52428800,  -- 50MB
  MAX_DOCUMENTS = 25000,
  AUTO_INDEX_ID = false
);

-- Send chat messages
INSERT INTO chat_general (
  timestamp,
  message_id,
  channel_id,
  content,
  message_type,
  sender_user_id,
  sender_username,
  mentions,
  attachments
) VALUES (
  CURRENT_TIMESTAMP,
  UUID_GENERATE_V4(),
  'general',
  'Hello everyone! How is everyone doing today?',
  'text',
  12345,
  'johndoe',
  ARRAY[]::TEXT[],
  ARRAY[]::JSONB[]
);

-- Get recent chat history with natural ordering
SELECT 
  timestamp,
  content,
  sender_username,
  message_type,
  CASE WHEN ARRAY_LENGTH(mentions, 1) > 0 THEN 
    'Mentions: ' || ARRAY_TO_STRING(mentions, ', ') 
    ELSE '' 
  END as mention_info
FROM chat_general
WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '2 hours'
  AND deleted = false
ORDER BY $NATURAL ASC  -- Maintain insertion order for chat
LIMIT 50;

-- Real-time chat streaming
DECLARE @chat_stream TAILABLE CURSOR FOR
SELECT 
  timestamp,
  message_id,
  content,
  sender_user_id,
  sender_username,
  mentions
FROM chat_general
WHERE timestamp >= CURRENT_TIMESTAMP  -- Only new messages
  AND deleted = false
ORDER BY $NATURAL ASC;

-- Audit logging with capped collections
CREATE CAPPED COLLECTION audit_system
WITH (
  MAX_SIZE = 1073741824,  -- 1GB for audit logs
  MAX_DOCUMENTS = 500000,
  AUTO_INDEX_ID = false
);

-- Record audit events with high integrity
INSERT INTO audit_system (
  timestamp,
  audit_id,
  event_type,
  action,
  resource,
  resource_id,
  user_id,
  username,
  ip_address,
  outcome,
  risk_score,
  compliance_regulations
) VALUES (
  CURRENT_TIMESTAMP,
  UUID_GENERATE_V4(),
  'data_access',
  'SELECT',
  'customer_records',
  'cust_12345',
  67890,
  'jane.analyst',
  '192.168.1.200',
  'success',
  15,
  ARRAY['GDPR', 'SOX']
);

-- Audit log analysis and compliance reporting
WITH audit_summary AS (
  SELECT 
    event_type,
    action,
    resource,
    outcome,
    DATE_TRUNC('hour', timestamp) as hour_bucket,
    COUNT(*) as event_count,
    COUNT(DISTINCT user_id) as unique_users,
    AVG(risk_score) as avg_risk_score,
    SUM(CASE WHEN outcome = 'failure' THEN 1 ELSE 0 END) as failure_count
  FROM audit_system
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  GROUP BY event_type, action, resource, outcome, DATE_TRUNC('hour', timestamp)
)
SELECT 
  hour_bucket,
  event_type,
  action,
  resource,
  SUM(event_count) as total_events,
  SUM(unique_users) as total_unique_users,
  ROUND(AVG(avg_risk_score), 1) as avg_risk_score,
  SUM(failure_count) as total_failures,
  ROUND((SUM(failure_count)::float / SUM(event_count)) * 100, 2) as failure_rate_pct,

  -- Compliance flags
  CASE 
    WHEN SUM(failure_count) > SUM(event_count) * 0.1 THEN 'HIGH_FAILURE_RATE'
    WHEN AVG(avg_risk_score) > 80 THEN 'HIGH_RISK_ACTIVITY' 
    ELSE 'NORMAL'
  END as compliance_flag

FROM audit_summary
GROUP BY hour_bucket, event_type, action, resource
HAVING SUM(event_count) >= 5  -- Minimum activity threshold
ORDER BY hour_bucket DESC, total_events DESC;

-- Capped collection maintenance and optimization
SELECT 
  collection_name,
  max_size_mb,
  current_size_mb,
  size_efficiency_pct,
  max_documents,
  current_documents,
  doc_efficiency_pct,
  oldest_record,
  newest_record,
  retention_hours,

  -- Optimization recommendations
  CASE 
    WHEN size_efficiency_pct < 50 THEN 'REDUCE_MAX_SIZE'
    WHEN size_efficiency_pct > 90 THEN 'INCREASE_MAX_SIZE'
    WHEN doc_efficiency_pct < 50 THEN 'REDUCE_MAX_DOCS'  
    WHEN retention_hours < 1 THEN 'INCREASE_COLLECTION_SIZE'
    ELSE 'OPTIMAL'
  END as optimization_recommendation

FROM (
  SELECT 
    'application_logs' as collection_name,
    MAX_SIZE() / 1024 / 1024 as max_size_mb,
    CURRENT_SIZE() / 1024 / 1024 as current_size_mb,
    ROUND((CURRENT_SIZE()::float / MAX_SIZE()) * 100, 1) as size_efficiency_pct,
    MAX_DOCUMENTS() as max_documents,
    DOCUMENT_COUNT() as current_documents,
    ROUND((DOCUMENT_COUNT()::float / MAX_DOCUMENTS()) * 100, 1) as doc_efficiency_pct,
    MIN(timestamp) as oldest_record,
    MAX(timestamp) as newest_record,
    ROUND(EXTRACT(EPOCH FROM (MAX(timestamp) - MIN(timestamp))) / 3600, 1) as retention_hours
  FROM application_logs

  UNION ALL

  SELECT 
    'performance_metrics' as collection_name,
    MAX_SIZE() / 1024 / 1024 as max_size_mb,
    CURRENT_SIZE() / 1024 / 1024 as current_size_mb,
    ROUND((CURRENT_SIZE()::float / MAX_SIZE()) * 100, 1) as size_efficiency_pct,
    MAX_DOCUMENTS() as max_documents,  
    DOCUMENT_COUNT() as current_documents,
    ROUND((DOCUMENT_COUNT()::float / MAX_DOCUMENTS()) * 100, 1) as doc_efficiency_pct,
    MIN(timestamp) as oldest_record,
    MAX(timestamp) as newest_record,
    ROUND(EXTRACT(EPOCH FROM (MAX(timestamp) - MIN(timestamp))) / 3600, 1) as retention_hours
  FROM performance_metrics
) capped_stats
ORDER BY size_efficiency_pct DESC;

-- QueryLeaf provides comprehensive capped collection features:
-- 1. SQL-familiar CREATE CAPPED COLLECTION syntax
-- 2. Automatic circular buffer behavior with size and document limits
-- 3. Natural ordering support ($NATURAL) for insertion-order queries
-- 4. Tailable cursor support for real-time streaming
-- 5. Built-in collection statistics and monitoring functions
-- 6. Performance optimization recommendations
-- 7. Integration with standard SQL analytics and aggregation functions
-- 8. Compliance and audit logging patterns
-- 9. Real-time alerting and anomaly detection
-- 10. Seamless integration with MongoDB's capped collection performance benefits

Best Practices for Capped Collections

Design Guidelines

Essential practices for effective capped collection usage:

Size Planning: Calculate appropriate collection sizes based on data velocity and retention requirements
Document Size: Keep documents reasonably sized to maximize the number of records within size limits
No Updates: Design for append-only workloads since capped collections don't support updates that increase document size
Natural Ordering: Leverage natural insertion ordering for optimal query performance
Index Strategy: Use minimal indexing to maintain high insert performance
Monitoring: Implement monitoring to track utilization and performance characteristics

Use Case Selection

Choose capped collections for appropriate scenarios:

High-Volume Logs: Application logs, access logs, error logs with automatic rotation
Real-Time Analytics: Metrics, performance data, sensor readings with fixed retention
Event Streaming: Message queues, event sourcing, activity streams
Chat and Messaging: Real-time messaging systems with automatic message history management
Audit Trails: Compliance logging with predictable storage requirements
Cache-Like Data: Temporary data storage with automatic eviction policies

Conclusion

MongoDB capped collections provide specialized solutions for high-volume, streaming data scenarios where traditional database approaches fall short. By implementing fixed-size circular buffers at the database level, capped collections deliver predictable performance, automatic data lifecycle management, and built-in support for real-time streaming applications.

Key capped collection benefits include:

Predictable Performance: Fixed size ensures consistent insert and query performance regardless of data volume
Automatic Management: No manual cleanup or data retention policies required
High Throughput: Optimized for append-only workloads with minimal index overhead
Natural Ordering: Guaranteed insertion order preservation for time-series data
Real-Time Streaming: Built-in tailable cursor support for live data processing

Whether you're building logging systems, real-time analytics platforms, chat applications, or event streaming architectures, MongoDB capped collections with QueryLeaf's familiar SQL interface provide the foundation for high-performance data management. This combination enables you to implement sophisticated streaming data solutions while preserving familiar development patterns and query approaches.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB capped collection creation, sizing, and optimization while providing SQL-familiar CREATE CAPPED COLLECTION syntax and natural ordering support. Complex streaming patterns, real-time analytics, and circular buffer management are seamlessly handled through familiar SQL patterns, making high-performance streaming data both powerful and accessible.

The integration of automatic circular buffer management with SQL-style query patterns makes MongoDB an ideal platform for applications requiring both high-volume data ingestion and familiar database interaction patterns, ensuring your streaming data solutions remain both performant and maintainable as they scale and evolve.