Blog

November 19, 2025
25 min read

MongoDB Document Embedding vs Referencing Patterns: Advanced Data Modeling for High-Performance Applications

Effective data modeling is fundamental to application performance, scalability, and maintainability, particularly in document-oriented databases where developers have the flexibility to structure data in multiple ways. Traditional relational databases enforce normalized structures through foreign key relationships, but this approach often requires complex joins that become performance bottlenecks as data volumes and query complexity increase.

MongoDB's document model enables sophisticated data modeling strategies through document embedding and referencing patterns that can dramatically improve query performance and simplify application logic. Unlike relational databases that require expensive joins across multiple tables, MongoDB allows developers to embed related data directly within documents or use efficient referencing patterns optimized for document retrieval and updates.

The Traditional Relational Modeling Challenge

Relational database modeling often requires complex normalization and expensive join operations:

-- Traditional PostgreSQL normalized schema - complex joins required

-- Order management system with multiple related tables
CREATE TABLE customers (
    customer_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    company_name VARCHAR(500) NOT NULL,
    email VARCHAR(320) NOT NULL UNIQUE,
    phone VARCHAR(20),

    -- Address information (could be separate table)
    billing_address_line1 VARCHAR(200) NOT NULL,
    billing_address_line2 VARCHAR(200),
    billing_city VARCHAR(100) NOT NULL,
    billing_state VARCHAR(50) NOT NULL,
    billing_postal_code VARCHAR(20) NOT NULL,
    billing_country VARCHAR(3) NOT NULL DEFAULT 'USA',

    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE products (
    product_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    sku VARCHAR(100) NOT NULL UNIQUE,
    name VARCHAR(500) NOT NULL,
    description TEXT,
    base_price DECIMAL(10,2) NOT NULL,
    category_id UUID NOT NULL,

    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,

    FOREIGN KEY (category_id) REFERENCES product_categories(category_id)
);

CREATE TABLE product_categories (
    category_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    category_name VARCHAR(200) NOT NULL,
    parent_category_id UUID,
    category_path TEXT, -- Materialized path for hierarchy

    FOREIGN KEY (parent_category_id) REFERENCES product_categories(category_id)
);

CREATE TABLE orders (
    order_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    customer_id UUID NOT NULL,
    order_number VARCHAR(50) NOT NULL UNIQUE,
    order_status VARCHAR(20) NOT NULL DEFAULT 'pending',

    -- Order totals
    subtotal DECIMAL(12,2) NOT NULL DEFAULT 0.00,
    tax_amount DECIMAL(12,2) NOT NULL DEFAULT 0.00,
    shipping_amount DECIMAL(12,2) NOT NULL DEFAULT 0.00,
    discount_amount DECIMAL(12,2) NOT NULL DEFAULT 0.00,
    total_amount DECIMAL(12,2) NOT NULL DEFAULT 0.00,

    -- Shipping information
    shipping_address_line1 VARCHAR(200),
    shipping_address_line2 VARCHAR(200),
    shipping_city VARCHAR(100),
    shipping_state VARCHAR(50),
    shipping_postal_code VARCHAR(20),
    shipping_country VARCHAR(3),

    -- Timestamps
    order_date TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    shipped_date TIMESTAMP,
    delivered_date TIMESTAMP,

    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,

    FOREIGN KEY (customer_id) REFERENCES customers(customer_id),

    CONSTRAINT chk_order_status 
        CHECK (order_status IN ('pending', 'confirmed', 'processing', 'shipped', 'delivered', 'cancelled'))
);

CREATE TABLE order_items (
    order_item_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    order_id UUID NOT NULL,
    product_id UUID NOT NULL,
    quantity INTEGER NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL,
    line_total DECIMAL(12,2) NOT NULL,

    -- Product snapshot data (denormalized for historical accuracy)
    product_sku VARCHAR(100) NOT NULL,
    product_name VARCHAR(500) NOT NULL,
    product_description TEXT,

    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,

    FOREIGN KEY (order_id) REFERENCES orders(order_id) ON DELETE CASCADE,
    FOREIGN KEY (product_id) REFERENCES products(product_id),

    CONSTRAINT chk_quantity_positive CHECK (quantity > 0),
    CONSTRAINT chk_unit_price_positive CHECK (unit_price >= 0),
    CONSTRAINT chk_line_total_calculation CHECK (line_total = quantity * unit_price)
);

CREATE TABLE order_status_history (
    status_history_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    order_id UUID NOT NULL,
    previous_status VARCHAR(20),
    new_status VARCHAR(20) NOT NULL,
    status_changed_by UUID, -- User who changed the status
    status_change_reason TEXT,
    status_changed_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,

    FOREIGN KEY (order_id) REFERENCES orders(order_id) ON DELETE CASCADE
);

-- Complex query to get complete order information - expensive joins
SELECT 
    -- Order information
    o.order_id,
    o.order_number,
    o.order_status,
    o.total_amount,
    o.order_date,

    -- Customer information (requires join)
    c.customer_id,
    c.company_name,
    c.email as customer_email,
    c.phone as customer_phone,

    -- Customer billing address
    c.billing_address_line1,
    c.billing_address_line2,
    c.billing_city,
    c.billing_state,
    c.billing_postal_code,
    c.billing_country,

    -- Order shipping address
    o.shipping_address_line1,
    o.shipping_address_line2,
    o.shipping_city,
    o.shipping_state,
    o.shipping_postal_code,
    o.shipping_country,

    -- Order items aggregation (requires complex subquery)
    (
        SELECT JSON_AGG(
            JSON_BUILD_OBJECT(
                'product_id', oi.product_id,
                'product_sku', oi.product_sku,
                'product_name', oi.product_name,
                'quantity', oi.quantity,
                'unit_price', oi.unit_price,
                'line_total', oi.line_total,
                'category', pc.category_name,
                'category_path', pc.category_path
            ) ORDER BY oi.created_at
        )
        FROM order_items oi
        JOIN products p ON oi.product_id = p.product_id
        JOIN product_categories pc ON p.category_id = pc.category_id
        WHERE oi.order_id = o.order_id
    ) as order_items,

    -- Order status history (requires another subquery)
    (
        SELECT JSON_AGG(
            JSON_BUILD_OBJECT(
                'previous_status', osh.previous_status,
                'new_status', osh.new_status,
                'changed_at', osh.status_changed_at,
                'reason', osh.status_change_reason
            ) ORDER BY osh.status_changed_at DESC
        )
        FROM order_status_history osh
        WHERE osh.order_id = o.order_id
    ) as status_history

FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_id = $1;

-- Performance problems with traditional approach:
-- 1. Multiple JOIN operations create expensive query execution plans
-- 2. N+1 query problems when loading order lists with items
-- 3. Complex aggregations require subqueries and temporary result sets
-- 4. Schema changes require coordinated migrations across multiple tables
-- 5. Maintaining referential integrity across tables adds overhead
-- 6. Distributed transactions become complex with multiple related tables
-- 7. Caching strategies are complicated by normalized data spread across tables
-- 8. Application code becomes complex managing relationships between entities
-- 9. Query optimization requires deep understanding of join algorithms and indexing
-- 10. Scaling reads requires careful replication and read-replica routing strategies

MongoDB provides flexible document modeling with embedding and referencing patterns:

// MongoDB Document Modeling - flexible embedding and referencing patterns
const { MongoClient, ObjectId } = require('mongodb');

// Advanced MongoDB Document Modeling Manager
class MongoDocumentModelingManager {
  constructor() {
    this.client = null;
    this.db = null;
    this.modelingStrategies = new Map();
    this.performanceMetrics = new Map();
  }

  async initialize() {
    console.log('Initializing MongoDB Document Modeling Manager...');

    // Connect with optimized settings for document operations
    this.client = new MongoClient(process.env.MONGODB_URI || 'mongodb://localhost:27017', {
      // Optimized for document operations
      maxPoolSize: 20,
      minPoolSize: 5,
      maxIdleTimeMS: 30000,

      // Read preferences for document modeling
      readPreference: 'primaryPreferred',
      readConcern: { level: 'local' },

      // Write concern for consistency
      writeConcern: { w: 1, j: true },

      // Compression for large documents
      compressors: ['zlib'],

      appName: 'DocumentModelingManager'
    });

    await this.client.connect();
    this.db = this.client.db('ecommerce');

    // Initialize document modeling strategies
    await this.setupModelingStrategies();

    console.log('✅ MongoDB Document Modeling Manager initialized');
  }

  async setupModelingStrategies() {
    console.log('Setting up document modeling strategies...');

    const strategies = {
      // Embedding strategy for one-to-few relationships
      'embed_small_related': {
        name: 'Embed Small Related Data',
        description: 'Embed small, frequently accessed related documents',
        useCases: ['order_items', 'user_preferences', 'product_variants'],
        benefits: ['Single query retrieval', 'Atomic updates', 'Better performance'],
        limitations: ['Document size limits', 'Update complexity for arrays'],
        maxDocumentSize: 16777216, // 16MB MongoDB limit
        maxArrayElements: 1000     // Practical limit for embedded arrays
      },

      // Referencing strategy for one-to-many relationships
      'reference_large_collections': {
        name: 'Reference Large Collections',
        description: 'Use references for large or independently managed collections',
        useCases: ['user_orders', 'product_reviews', 'transaction_history'],
        benefits: ['Flexible querying', 'Independent updates', 'Smaller documents'],
        limitations: ['Multiple queries needed', 'No atomic cross-document updates'],
        maxReferencedDocuments: 1000000 // Practical query limit
      },

      // Hybrid strategy for complex relationships
      'hybrid_denormalization': {
        name: 'Hybrid Denormalization',
        description: 'Combine embedding and referencing with selective denormalization',
        useCases: ['order_with_customer_summary', 'product_with_category_details'],
        benefits: ['Optimized for read patterns', 'Reduced query complexity', 'Good performance'],
        limitations: ['Data duplication', 'Update coordination needed'],
        denormalizationFields: ['frequently_accessed', 'rarely_changed', 'small_size']
      }
    };

    for (const [strategyKey, strategy] of Object.entries(strategies)) {
      this.modelingStrategies.set(strategyKey, strategy);
    }

    console.log('✅ Document modeling strategies initialized');
  }

  // Embedding Pattern Implementation
  async createEmbeddedOrderDocument(orderData) {
    console.log('Creating embedded order document...');

    try {
      const embeddedOrder = {
        _id: new ObjectId(),
        orderNumber: orderData.orderNumber,
        orderDate: new Date(),
        status: 'pending',

        // Embedded customer information (frequently accessed, rarely changes)
        customer: {
          customerId: orderData.customer.customerId,
          companyName: orderData.customer.companyName,
          email: orderData.customer.email,
          phone: orderData.customer.phone,

          // Embedded billing address
          billingAddress: {
            line1: orderData.customer.billingAddress.line1,
            line2: orderData.customer.billingAddress.line2,
            city: orderData.customer.billingAddress.city,
            state: orderData.customer.billingAddress.state,
            postalCode: orderData.customer.billingAddress.postalCode,
            country: orderData.customer.billingAddress.country
          }
        },

        // Embedded shipping address
        shippingAddress: {
          line1: orderData.shippingAddress.line1,
          line2: orderData.shippingAddress.line2,
          city: orderData.shippingAddress.city,
          state: orderData.shippingAddress.state,
          postalCode: orderData.shippingAddress.postalCode,
          country: orderData.shippingAddress.country
        },

        // Embedded order items (one-to-few relationship)
        items: orderData.items.map(item => ({
          itemId: new ObjectId(),
          productId: item.productId,

          // Denormalized product information (snapshot for historical accuracy)
          product: {
            sku: item.product.sku,
            name: item.product.name,
            description: item.product.description,
            category: item.product.category,
            categoryPath: item.product.categoryPath
          },

          quantity: item.quantity,
          unitPrice: item.unitPrice,
          lineTotal: item.quantity * item.unitPrice,

          // Item-specific metadata
          addedAt: new Date(),
          notes: item.notes || null
        })),

        // Order totals (calculated from items)
        totals: {
          subtotal: orderData.items.reduce((sum, item) => sum + (item.quantity * item.unitPrice), 0),
          taxAmount: orderData.taxAmount || 0,
          shippingAmount: orderData.shippingAmount || 0,
          discountAmount: orderData.discountAmount || 0,
          total: 0 // Will be calculated
        },

        // Embedded status history
        statusHistory: [{
          previousStatus: null,
          newStatus: 'pending',
          changedAt: new Date(),
          changedBy: orderData.createdBy,
          reason: 'Order created'
        }],

        // Payment information (embedded for atomic updates)
        payment: {
          paymentMethod: orderData.payment.method,
          paymentStatus: 'pending',
          transactions: []
        },

        // Metadata
        metadata: {
          createdAt: new Date(),
          updatedAt: new Date(),
          createdBy: orderData.createdBy,
          version: 1,
          dataModelingStrategy: 'embedded'
        }
      };

      // Calculate total
      embeddedOrder.totals.total = 
        embeddedOrder.totals.subtotal + 
        embeddedOrder.totals.taxAmount + 
        embeddedOrder.totals.shippingAmount - 
        embeddedOrder.totals.discountAmount;

      // Insert the complete embedded document
      const result = await this.db.collection('orders_embedded').insertOne(embeddedOrder);

      console.log('✅ Embedded order document created:', result.insertedId);

      return {
        success: true,
        orderId: result.insertedId,
        strategy: 'embedded',
        documentSize: JSON.stringify(embeddedOrder).length,
        embeddedCollections: ['customer', 'items', 'statusHistory', 'payment']
      };

    } catch (error) {
      console.error('Error creating embedded order document:', error);
      return { success: false, error: error.message };
    }
  }

  // Referencing Pattern Implementation
  async createReferencedOrderDocument(orderData) {
    console.log('Creating referenced order document...');

    const session = this.client.startSession();

    try {
      return await session.withTransaction(async () => {
        // Create main order document with references
        const referencedOrder = {
          _id: new ObjectId(),
          orderNumber: orderData.orderNumber,
          orderDate: new Date(),
          status: 'pending',

          // Reference to customer document
          customerId: new ObjectId(orderData.customer.customerId),

          // Embedded shipping address (order-specific, not shared)
          shippingAddress: {
            line1: orderData.shippingAddress.line1,
            line2: orderData.shippingAddress.line2,
            city: orderData.shippingAddress.city,
            state: orderData.shippingAddress.state,
            postalCode: orderData.shippingAddress.postalCode,
            country: orderData.shippingAddress.country
          },

          // Order totals
          totals: {
            subtotal: 0, // Will be calculated from items
            taxAmount: orderData.taxAmount || 0,
            shippingAmount: orderData.shippingAmount || 0,
            discountAmount: orderData.discountAmount || 0,
            total: 0
          },

          // Reference to payment document
          paymentId: null, // Will be set after payment creation

          // Metadata
          metadata: {
            createdAt: new Date(),
            updatedAt: new Date(),
            createdBy: orderData.createdBy,
            version: 1,
            dataModelingStrategy: 'referenced'
          }
        };

        // Insert main order document
        const orderResult = await this.db.collection('orders_referenced')
          .insertOne(referencedOrder, { session });

        // Create separate order items documents
        const orderItems = orderData.items.map(item => ({
          _id: new ObjectId(),
          orderId: orderResult.insertedId,
          productId: new ObjectId(item.productId),

          quantity: item.quantity,
          unitPrice: item.unitPrice,
          lineTotal: item.quantity * item.unitPrice,

          // Metadata
          addedAt: new Date(),
          notes: item.notes || null
        }));

        const itemsResult = await this.db.collection('order_items_referenced')
          .insertMany(orderItems, { session });

        // Calculate and update order totals
        const subtotal = orderItems.reduce((sum, item) => sum + item.lineTotal, 0);
        const total = subtotal + referencedOrder.totals.taxAmount + 
                     referencedOrder.totals.shippingAmount - 
                     referencedOrder.totals.discountAmount;

        await this.db.collection('orders_referenced').updateOne(
          { _id: orderResult.insertedId },
          { 
            $set: { 
              'totals.subtotal': subtotal,
              'totals.total': total,
              'metadata.updatedAt': new Date()
            }
          },
          { session }
        );

        // Create initial status history document
        await this.db.collection('order_status_history').insertOne({
          _id: new ObjectId(),
          orderId: orderResult.insertedId,
          previousStatus: null,
          newStatus: 'pending',
          changedAt: new Date(),
          changedBy: orderData.createdBy,
          reason: 'Order created'
        }, { session });

        console.log('✅ Referenced order documents created');

        return {
          success: true,
          orderId: orderResult.insertedId,
          strategy: 'referenced',
          relatedCollections: {
            orderItems: Object.values(itemsResult.insertedIds).length,
            statusHistory: 1
          }
        };
      });

    } catch (error) {
      console.error('Error creating referenced order document:', error);
      return { success: false, error: error.message };
    } finally {
      await session.endSession();
    }
  }

  // Hybrid Pattern Implementation
  async createHybridOrderDocument(orderData) {
    console.log('Creating hybrid order document with selective denormalization...');

    try {
      const hybridOrder = {
        _id: new ObjectId(),
        orderNumber: orderData.orderNumber,
        orderDate: new Date(),
        status: 'pending',

        // Hybrid customer approach: embed summary, reference full document
        customer: {
          // Denormalized frequently accessed fields
          customerId: new ObjectId(orderData.customer.customerId),
          companyName: orderData.customer.companyName,
          email: orderData.customer.email,

          // Reference for complete customer information
          customerRef: {
            collection: 'customers',
            id: new ObjectId(orderData.customer.customerId)
          }
        },

        // Embedded shipping address
        shippingAddress: {
          line1: orderData.shippingAddress.line1,
          line2: orderData.shippingAddress.line2,
          city: orderData.shippingAddress.city,
          state: orderData.shippingAddress.state,
          postalCode: orderData.shippingAddress.postalCode,
          country: orderData.shippingAddress.country
        },

        // Hybrid items approach: embed summary, reference details for large catalogs
        itemsSummary: {
          totalItems: orderData.items.length,
          totalQuantity: orderData.items.reduce((sum, item) => sum + item.quantity, 0),
          uniqueProducts: [...new Set(orderData.items.map(item => item.productId))].length,

          // Embed small item summaries for quick display
          quickView: orderData.items.slice(0, 5).map(item => ({
            productId: new ObjectId(item.productId),
            productName: item.product.name,
            quantity: item.quantity,
            unitPrice: item.unitPrice,
            lineTotal: item.quantity * item.unitPrice
          }))
        },

        // Reference to complete items collection for large orders
        itemsCollection: orderData.items.length > 10 ? {
          collection: 'order_items_detailed',
          orderId: null // Will be set after document creation
        } : null,

        // Embed all items for small orders (< 10 items)
        items: orderData.items.length <= 10 ? orderData.items.map(item => ({
          itemId: new ObjectId(),
          productId: new ObjectId(item.productId),

          // Denormalized product summary
          product: {
            sku: item.product.sku,
            name: item.product.name,
            category: item.product.category
          },

          quantity: item.quantity,
          unitPrice: item.unitPrice,
          lineTotal: item.quantity * item.unitPrice
        })) : [],

        // Order totals with embedded calculations
        totals: {
          subtotal: orderData.items.reduce((sum, item) => sum + (item.quantity * item.unitPrice), 0),
          taxAmount: orderData.taxAmount || 0,
          shippingAmount: orderData.shippingAmount || 0,
          discountAmount: orderData.discountAmount || 0,
          total: 0 // Will be calculated
        },

        // Recent status embedded, full history referenced
        currentStatus: {
          status: 'pending',
          updatedAt: new Date(),
          updatedBy: orderData.createdBy
        },

        statusHistoryRef: {
          collection: 'order_status_history',
          orderId: null // Will be set after document creation
        },

        // Metadata with modeling strategy information
        metadata: {
          createdAt: new Date(),
          updatedAt: new Date(),
          createdBy: orderData.createdBy,
          version: 1,
          dataModelingStrategy: 'hybrid',
          embeddingDecisions: {
            customer: 'partial_denormalization',
            items: orderData.items.length <= 10 ? 'embedded' : 'referenced_with_summary',
            statusHistory: 'current_embedded_history_referenced'
          }
        }
      };

      // Calculate total
      hybridOrder.totals.total = 
        hybridOrder.totals.subtotal + 
        hybridOrder.totals.taxAmount + 
        hybridOrder.totals.shippingAmount - 
        hybridOrder.totals.discountAmount;

      // Insert the hybrid order document
      const result = await this.db.collection('orders_hybrid').insertOne(hybridOrder);

      // If large order, create separate detailed items collection
      if (orderData.items.length > 10) {
        const detailedItems = orderData.items.map(item => ({
          _id: new ObjectId(),
          orderId: result.insertedId,
          productId: new ObjectId(item.productId),

          // Full product information for detailed operations
          product: {
            sku: item.product.sku,
            name: item.product.name,
            description: item.product.description,
            category: item.product.category,
            categoryPath: item.product.categoryPath,
            specifications: item.product.specifications
          },

          quantity: item.quantity,
          unitPrice: item.unitPrice,
          lineTotal: item.quantity * item.unitPrice,

          // Additional item metadata
          addedAt: new Date(),
          notes: item.notes || null,
          customizations: item.customizations || {}
        }));

        await this.db.collection('order_items_detailed').insertMany(detailedItems);

        // Update reference in main document
        await this.db.collection('orders_hybrid').updateOne(
          { _id: result.insertedId },
          { $set: { 'itemsCollection.orderId': result.insertedId } }
        );
      }

      // Create initial status history
      await this.db.collection('order_status_history').insertOne({
        _id: new ObjectId(),
        orderId: result.insertedId,
        previousStatus: null,
        newStatus: 'pending',
        changedAt: new Date(),
        changedBy: orderData.createdBy,
        reason: 'Order created'
      });

      // Update status history reference
      await this.db.collection('orders_hybrid').updateOne(
        { _id: result.insertedId },
        { $set: { 'statusHistoryRef.orderId': result.insertedId } }
      );

      console.log('✅ Hybrid order document created:', result.insertedId);

      return {
        success: true,
        orderId: result.insertedId,
        strategy: 'hybrid',
        embeddingDecisions: hybridOrder.metadata.embeddingDecisions,
        documentSize: JSON.stringify(hybridOrder).length,
        separateCollections: orderData.items.length > 10 ? ['order_items_detailed'] : []
      };

    } catch (error) {
      console.error('Error creating hybrid order document:', error);
      return { success: false, error: error.message };
    }
  }

  async performModelingStrategyComparison(orderData) {
    console.log('Performing modeling strategy comparison...');

    const strategies = ['embedded', 'referenced', 'hybrid'];
    const results = {};

    for (const strategy of strategies) {
      const startTime = Date.now();
      let result;

      try {
        switch (strategy) {
          case 'embedded':
            result = await this.createEmbeddedOrderDocument(orderData);
            break;
          case 'referenced':
            result = await this.createReferencedOrderDocument(orderData);
            break;
          case 'hybrid':
            result = await this.createHybridOrderDocument(orderData);
            break;
        }

        const executionTime = Date.now() - startTime;

        // Perform read performance test
        const readStartTime = Date.now();
        await this.retrieveOrderByStrategy(result.orderId, strategy);
        const readTime = Date.now() - readStartTime;

        results[strategy] = {
          success: result.success,
          orderId: result.orderId,
          creationTime: executionTime,
          readTime: readTime,
          documentSize: result.documentSize || null,
          collections: this.getCollectionCountForStrategy(strategy),
          advantages: this.getStrategyAdvantages(strategy),
          limitations: this.getStrategyLimitations(strategy)
        };

      } catch (error) {
        results[strategy] = {
          success: false,
          error: error.message,
          creationTime: Date.now() - startTime
        };
      }
    }

    // Generate comparison analysis
    const analysis = this.analyzeStrategyComparison(results, orderData);

    return {
      timestamp: new Date(),
      orderData: {
        itemCount: orderData.items.length,
        customerType: orderData.customer.type,
        orderValue: orderData.items.reduce((sum, item) => sum + (item.quantity * item.unitPrice), 0)
      },
      results: results,
      analysis: analysis,
      recommendation: this.generateStrategyRecommendation(results, orderData)
    };
  }

  async retrieveOrderByStrategy(orderId, strategy) {
    switch (strategy) {
      case 'embedded':
        return await this.db.collection('orders_embedded').findOne({ _id: orderId });

      case 'referenced':
        const order = await this.db.collection('orders_referenced').findOne({ _id: orderId });
        if (order) {
          // Simulate additional queries needed for referenced data
          const items = await this.db.collection('order_items_referenced')
            .find({ orderId: orderId }).toArray();
          const customer = await this.db.collection('customers')
            .findOne({ _id: order.customerId });
          const statusHistory = await this.db.collection('order_status_history')
            .find({ orderId: orderId }).toArray();

          return { ...order, items, customer, statusHistory };
        }
        return order;

      case 'hybrid':
        const hybridOrder = await this.db.collection('orders_hybrid').findOne({ _id: orderId });
        if (hybridOrder && hybridOrder.itemsCollection) {
          // Load detailed items if referenced
          const detailedItems = await this.db.collection('order_items_detailed')
            .find({ orderId: orderId }).toArray();
          hybridOrder.detailedItems = detailedItems;
        }
        return hybridOrder;

      default:
        throw new Error(`Unknown strategy: ${strategy}`);
    }
  }

  getCollectionCountForStrategy(strategy) {
    const collectionCounts = {
      embedded: 1,    // Only orders_embedded
      referenced: 3,  // orders_referenced, order_items_referenced, order_status_history
      hybrid: 2       // orders_hybrid, order_status_history (+ conditional order_items_detailed)
    };
    return collectionCounts[strategy] || 0;
  }

  getStrategyAdvantages(strategy) {
    const advantages = {
      embedded: [
        'Single query retrieval for complete order',
        'Atomic updates across related data',
        'Better read performance for order details',
        'Simplified application logic',
        'Natural data locality'
      ],
      referenced: [
        'Flexible independent querying of entities',
        'Smaller individual document sizes',
        'Easy to update individual components',
        'Better for large item collections',
        'Familiar relational-style patterns'
      ],
      hybrid: [
        'Optimized for specific access patterns',
        'Best read performance for common operations',
        'Balanced document sizes',
        'Flexibility for both embedded and referenced data',
        'Adaptive to data volume changes'
      ]
    };
    return advantages[strategy] || [];
  }

  getStrategyLimitations(strategy) {
    const limitations = {
      embedded: [
        'Document size limits (16MB)',
        'Complex updates for large arrays',
        'Potential for data duplication',
        'Less flexible for independent querying',
        'Growth limitations for embedded collections'
      ],
      referenced: [
        'Multiple queries needed for complete data',
        'No atomic cross-document transactions',
        'More complex application logic',
        'Potential N+1 query problems',
        'Reduced read performance'
      ],
      hybrid: [
        'Increased complexity in data modeling decisions',
        'Potential for data synchronization issues',
        'More maintenance overhead',
        'Complexity in query optimization',
        'Requires careful planning for access patterns'
      ]
    };
    return limitations[strategy] || [];
  }

  analyzeStrategyComparison(results, orderData) {
    const analysis = {
      performance: {},
      scalability: {},
      complexity: {},
      dataIntegrity: {}
    };

    // Performance analysis
    const fastest = Object.entries(results).reduce((fastest, [strategy, result]) => {
      if (result.success && (!fastest || result.readTime < fastest.readTime)) {
        return { strategy, readTime: result.readTime };
      }
      return fastest;
    }, null);

    analysis.performance.fastestRead = fastest;
    analysis.performance.readTimeComparison = Object.fromEntries(
      Object.entries(results).map(([strategy, result]) => [
        strategy, 
        result.success ? result.readTime : 'failed'
      ])
    );

    // Scalability analysis based on order characteristics
    const itemCount = orderData.items.length;
    if (itemCount <= 5) {
      analysis.scalability.recommendation = 'embedded';
      analysis.scalability.reason = 'Small item count ideal for embedding';
    } else if (itemCount <= 20) {
      analysis.scalability.recommendation = 'hybrid';
      analysis.scalability.reason = 'Medium item count benefits from hybrid approach';
    } else {
      analysis.scalability.recommendation = 'referenced';
      analysis.scalability.reason = 'Large item count requires referencing to avoid document size limits';
    }

    // Complexity analysis
    analysis.complexity.applicationLogic = {
      embedded: 'Low - single document operations',
      referenced: 'High - multiple collection coordination',
      hybrid: 'Medium - selective complexity based on data size'
    };

    // Data integrity analysis
    analysis.dataIntegrity = {
      embedded: 'High - atomic document updates',
      referenced: 'Medium - requires transaction coordination',
      hybrid: 'Medium - mixed atomic and coordinated updates'
    };

    return analysis;
  }

  generateStrategyRecommendation(results, orderData) {
    const itemCount = orderData.items.length;
    const orderValue = orderData.items.reduce((sum, item) => sum + (item.quantity * item.unitPrice), 0);

    // Decision matrix based on order characteristics
    if (itemCount <= 5 && orderValue < 1000) {
      return {
        recommendedStrategy: 'embedded',
        confidence: 'high',
        reasoning: 'Small order with few items ideal for embedded document pattern',
        benefits: [
          'Fastest read performance',
          'Simplest application logic',
          'Atomic updates',
          'Best for order display and processing'
        ],
        considerations: [
          'Monitor document growth over time',
          'Consider hybrid if order complexity increases'
        ]
      };
    } else if (itemCount <= 20) {
      return {
        recommendedStrategy: 'hybrid',
        confidence: 'high',
        reasoning: 'Medium-sized order benefits from selective embedding and referencing',
        benefits: [
          'Optimized for common access patterns',
          'Balanced performance and flexibility',
          'Handles growth well',
          'Good read performance with manageable complexity'
        ],
        considerations: [
          'Requires careful design of embedded vs referenced data',
          'Monitor access patterns to optimize embedding decisions'
        ]
      };
    } else {
      return {
        recommendedStrategy: 'referenced',
        confidence: 'medium',
        reasoning: 'Large order requires referencing to manage document size and complexity',
        benefits: [
          'Avoids document size limits',
          'Flexible querying of individual components',
          'Better for large-scale data management',
          'Easier to update individual items'
        ],
        considerations: [
          'Requires multiple queries for complete order data',
          'Consider caching strategies for performance',
          'Use transactions for data consistency'
        ]
      };
    }
  }

  async getModelingMetrics() {
    const collections = [
      'orders_embedded',
      'orders_referenced', 
      'order_items_referenced',
      'orders_hybrid',
      'order_items_detailed',
      'order_status_history'
    ];

    const metrics = {
      timestamp: new Date(),
      collections: {},
      summary: {
        totalOrders: 0,
        embeddedOrders: 0,
        referencedOrders: 0,
        hybridOrders: 0,
        averageDocumentSize: 0
      }
    };

    for (const collectionName of collections) {
      try {
        const collection = this.db.collection(collectionName);
        const stats = await this.db.command({ collStats: collectionName });

        metrics.collections[collectionName] = {
          documentCount: stats.count,
          storageSize: stats.storageSize,
          averageDocumentSize: stats.avgObjSize,
          indexCount: stats.nindexes,
          indexSize: stats.totalIndexSize
        };

        // Count orders by strategy
        if (collectionName.includes('orders')) {
          if (collectionName.includes('embedded')) {
            metrics.summary.embeddedOrders = stats.count;
          } else if (collectionName.includes('referenced')) {
            metrics.summary.referencedOrders = stats.count;
          } else if (collectionName.includes('hybrid')) {
            metrics.summary.hybridOrders = stats.count;
          }
        }

      } catch (error) {
        metrics.collections[collectionName] = { error: error.message };
      }
    }

    metrics.summary.totalOrders = 
      metrics.summary.embeddedOrders + 
      metrics.summary.referencedOrders + 
      metrics.summary.hybridOrders;

    return metrics;
  }

  async shutdown() {
    console.log('Shutting down MongoDB Document Modeling Manager...');

    if (this.client) {
      await this.client.close();
      console.log('✅ MongoDB connection closed');
    }

    this.modelingStrategies.clear();
    this.performanceMetrics.clear();
  }
}

// Export the document modeling manager
module.exports = { MongoDocumentModelingManager };

// Benefits of MongoDB Document Modeling:
// - Flexible embedding and referencing patterns eliminate complex join operations
// - Atomic document operations provide strong consistency for related data
// - Optimized read performance through denormalization and data locality
// - Adaptive modeling strategies that scale with data volume and access patterns
// - Simplified application logic through single-document operations
// - Natural data relationships that map to application object models
// - Hybrid approaches that balance performance, flexibility, and maintainability
// - Reduced query complexity through strategic data embedding
// - Better cache utilization through document-based data access
// - SQL-compatible document modeling patterns through QueryLeaf integration

Understanding MongoDB Document Relationships

Embedding vs Referencing Decision Framework

Choose the optimal modeling strategy based on data characteristics and access patterns:

// Advanced decision framework for embedding vs referencing
class DocumentModelingDecisionEngine {
  constructor() {
    this.decisionRules = new Map();
    this.performanceProfiles = new Map();
    this.scalabilityThresholds = new Map();
  }

  async analyzeRelationshipPattern(relationshipData) {
    console.log('Analyzing relationship pattern for optimal modeling strategy...');

    const analysis = {
      relationship: relationshipData,
      characteristics: this.analyzeDataCharacteristics(relationshipData),
      accessPatterns: this.analyzeAccessPatterns(relationshipData),
      scalabilityFactors: this.analyzeScalabilityFactors(relationshipData),
      recommendation: null
    };

    // Apply decision framework
    analysis.recommendation = this.generateModelingRecommendation(analysis);

    return analysis;
  }

  analyzeDataCharacteristics(relationshipData) {
    return {
      cardinality: this.determineCardinality(relationshipData),
      dataVolume: this.assessDataVolume(relationshipData),
      dataStability: this.assessDataStability(relationshipData),
      documentComplexity: this.assessDocumentComplexity(relationshipData)
    };
  }

  determineCardinality(relationshipData) {
    const { parentCollection, childCollection, relationshipType } = relationshipData;

    if (relationshipType === 'one-to-one') {
      return {
        type: 'one-to-one',
        recommendation: 'embed',
        confidence: 'high',
        reasoning: 'One-to-one relationships benefit from embedding for atomic operations'
      };
    } else if (relationshipType === 'one-to-few') {
      return {
        type: 'one-to-few',
        recommendation: 'embed',
        confidence: 'high',
        reasoning: 'Small collections (< 100 items) should be embedded for performance'
      };
    } else if (relationshipType === 'one-to-many') {
      return {
        type: 'one-to-many',
        recommendation: 'reference',
        confidence: 'medium',
        reasoning: 'Large collections may exceed document size limits if embedded'
      };
    } else if (relationshipType === 'many-to-many') {
      return {
        type: 'many-to-many',
        recommendation: 'reference',
        confidence: 'high',
        reasoning: 'Many-to-many relationships require referencing to avoid duplication'
      };
    }

    return { type: 'unknown', recommendation: 'analyze_further' };
  }

  assessDataVolume(relationshipData) {
    const { estimatedChildDocuments, averageChildSize, maxChildSize } = relationshipData;

    const estimatedEmbeddedSize = estimatedChildDocuments * averageChildSize;
    const maxEmbeddedSize = estimatedChildDocuments * maxChildSize;

    // MongoDB 16MB document limit
    const documentSizeLimit = 16 * 1024 * 1024; // 16MB
    const practicalLimit = documentSizeLimit * 0.8; // 80% of limit for safety

    return {
      estimatedSize: estimatedEmbeddedSize,
      maxPotentialSize: maxEmbeddedSize,
      exceedsLimit: maxEmbeddedSize > practicalLimit,
      volumeRecommendation: maxEmbeddedSize > practicalLimit ? 'reference' : 'embed',
      sizingDetails: {
        documentLimit: documentSizeLimit,
        practicalLimit: practicalLimit,
        utilizationPercent: (estimatedEmbeddedSize / practicalLimit) * 100
      }
    };
  }

  assessDataStability(relationshipData) {
    const { updateFrequency, childDocumentMutability, parentDocumentMutability } = relationshipData;

    if (updateFrequency === 'high' && childDocumentMutability === 'high') {
      return {
        stability: 'low',
        recommendation: 'reference',
        reasoning: 'High update frequency on embedded arrays can cause performance issues'
      };
    } else if (updateFrequency === 'low' && childDocumentMutability === 'low') {
      return {
        stability: 'high',
        recommendation: 'embed',
        reasoning: 'Stable data benefits from embedding for read performance'
      };
    } else {
      return {
        stability: 'medium',
        recommendation: 'hybrid',
        reasoning: 'Mixed stability patterns may benefit from selective embedding'
      };
    }
  }

  assessDocumentComplexity(relationshipData) {
    const { childDocumentStructure, nestingLevels, arrayComplexity } = relationshipData;

    const complexityScore = 
      (nestingLevels * 2) + 
      (arrayComplexity === 'high' ? 3 : arrayComplexity === 'medium' ? 2 : 1) +
      (Object.keys(childDocumentStructure).length * 0.1);

    return {
      complexityScore: complexityScore,
      level: complexityScore > 10 ? 'high' : complexityScore > 5 ? 'medium' : 'low',
      recommendation: complexityScore > 10 ? 'reference' : 'embed',
      reasoning: complexityScore > 10 
        ? 'High complexity documents should be referenced to maintain manageable parent documents'
        : 'Low to medium complexity allows for efficient embedding'
    };
  }

  analyzeAccessPatterns(relationshipData) {
    const { 
      parentReadFrequency, 
      childReadFrequency, 
      jointReadFrequency,
      independentChildQueries,
      bulkChildOperations 
    } = relationshipData.accessPatterns;

    return {
      primaryPattern: jointReadFrequency > (parentReadFrequency + childReadFrequency) / 2 
        ? 'joint_access' : 'independent_access',
      jointAccessRatio: jointReadFrequency / (parentReadFrequency + childReadFrequency),
      independentQueriesFrequent: independentChildQueries === 'high',
      bulkOperationsFrequent: bulkChildOperations === 'high',

      accessRecommendation: this.generateAccessBasedRecommendation(relationshipData.accessPatterns)
    };
  }

  generateAccessBasedRecommendation(accessPatterns) {
    const { jointReadFrequency, independentChildQueries, bulkChildOperations } = accessPatterns;

    if (jointReadFrequency === 'high' && independentChildQueries === 'low') {
      return {
        strategy: 'embed',
        confidence: 'high',
        reasoning: 'High joint read frequency with low independent queries favors embedding'
      };
    } else if (independentChildQueries === 'high' || bulkChildOperations === 'high') {
      return {
        strategy: 'reference',
        confidence: 'high',
        reasoning: 'High independent child operations favor referencing for query flexibility'
      };
    } else {
      return {
        strategy: 'hybrid',
        confidence: 'medium',
        reasoning: 'Mixed access patterns may benefit from hybrid approach with summary embedding'
      };
    }
  }

  analyzeScalabilityFactors(relationshipData) {
    const { growthProjections, performanceRequirements, maintenanceComplexity } = relationshipData;

    return {
      projectedGrowth: growthProjections,
      performanceTargets: performanceRequirements,
      maintenanceBurden: maintenanceComplexity,

      scalabilityRecommendation: this.generateScalabilityRecommendation(relationshipData)
    };
  }

  generateScalabilityRecommendation(relationshipData) {
    const { growthProjections, performanceRequirements } = relationshipData;

    if (growthProjections.childDocuments === 'exponential') {
      return {
        strategy: 'reference',
        confidence: 'high',
        reasoning: 'Exponential growth requires referencing to prevent document size issues',
        scalingConsiderations: [
          'Implement pagination for large result sets',
          'Consider sharding strategies for high-volume collections',
          'Monitor document size growth patterns'
        ]
      };
    } else if (performanceRequirements.readLatency === 'critical') {
      return {
        strategy: 'embed',
        confidence: 'high',
        reasoning: 'Critical read performance requirements favor embedding for single-query retrieval',
        scalingConsiderations: [
          'Monitor embedded array growth',
          'Implement size-based migration to referencing',
          'Consider read replicas for scaling reads'
        ]
      };
    } else {
      return {
        strategy: 'hybrid',
        confidence: 'medium',
        reasoning: 'Balanced growth and performance requirements suit hybrid approach',
        scalingConsiderations: [
          'Start with embedding, migrate to referencing as data grows',
          'Implement adaptive strategies based on document size',
          'Monitor access patterns for optimization opportunities'
        ]
      };
    }
  }

  generateModelingRecommendation(analysis) {
    const recommendations = {
      embedding: 0,
      referencing: 0,
      hybrid: 0
    };

    // Weight different factors
    const factors = [
      { factor: analysis.characteristics.cardinality, weight: 3 },
      { factor: analysis.characteristics.dataVolume, weight: 4 },
      { factor: analysis.characteristics.dataStability, weight: 2 },
      { factor: analysis.accessPatterns.accessRecommendation, weight: 3 },
      { factor: analysis.scalabilityFactors.scalabilityRecommendation, weight: 2 }
    ];

    // Score each strategy based on factor recommendations
    factors.forEach(({ factor, weight }) => {
      const strategy = factor.strategy || factor.recommendation;
      if (strategy && recommendations.hasOwnProperty(strategy.replace('embed', 'embedding').replace('reference', 'referencing'))) {
        const strategyKey = strategy.replace('embed', 'embedding').replace('reference', 'referencing');
        recommendations[strategyKey] += weight;
      }
    });

    // Find highest scoring strategy
    const recommendedStrategy = Object.entries(recommendations)
      .reduce((best, [strategy, score]) => score > best.score ? { strategy, score } : best, 
              { strategy: 'hybrid', score: 0 });

    return {
      strategy: recommendedStrategy.strategy,
      confidence: this.calculateConfidence(recommendations, recommendedStrategy.score),
      scores: recommendations,
      reasoning: this.generateDetailedReasoning(analysis, recommendedStrategy.strategy),
      implementation: this.generateImplementationGuidance(recommendedStrategy.strategy, analysis),
      monitoring: this.generateMonitoringRecommendations(recommendedStrategy.strategy, analysis)
    };
  }

  calculateConfidence(scores, topScore) {
    const totalScore = Object.values(scores).reduce((sum, score) => sum + score, 0);
    const confidence = topScore / totalScore;

    if (confidence > 0.7) return 'high';
    if (confidence > 0.5) return 'medium';
    return 'low';
  }

  generateDetailedReasoning(analysis, strategy) {
    const reasons = [];

    // Add reasoning based on analysis factors
    if (analysis.characteristics.cardinality.recommendation === strategy.replace('embedding', 'embed').replace('referencing', 'reference')) {
      reasons.push(`Cardinality pattern (${analysis.characteristics.cardinality.type}) supports ${strategy}`);
    }

    if (analysis.characteristics.dataVolume.volumeRecommendation === strategy.replace('embedding', 'embed').replace('referencing', 'reference')) {
      reasons.push(`Data volume analysis supports ${strategy} (${analysis.characteristics.dataVolume.utilizationPercent.toFixed(1)}% of document limit)`);
    }

    if (analysis.accessPatterns.accessRecommendation.strategy === strategy.replace('embedding', 'embed').replace('referencing', 'reference')) {
      reasons.push(`Access patterns favor ${strategy} (${analysis.accessPatterns.primaryPattern})`);
    }

    return reasons;
  }

  generateImplementationGuidance(strategy, analysis) {
    const baseGuidance = {
      embedding: {
        steps: [
          'Design parent document to include embedded child array/object',
          'Implement atomic update operations for parent and children',
          'Create indexes on embedded fields for query performance',
          'Monitor document size growth'
        ],
        patterns: ['embed_small_collections', 'denormalize_frequently_accessed', 'atomic_updates']
      },
      referencing: {
        steps: [
          'Create separate collections for parent and child entities',
          'Establish reference fields (foreign keys)',
          'Implement application-level joins or aggregation pipelines',
          'Use transactions for cross-collection consistency'
        ],
        patterns: ['reference_large_collections', 'independent_querying', 'flexible_relationships']
      },
      hybrid: {
        steps: [
          'Identify frequently accessed child data for embedding',
          'Embed summaries, reference full details',
          'Implement dual-path queries for different use cases',
          'Monitor access patterns and optimize embedding decisions'
        ],
        patterns: ['selective_denormalization', 'summary_embedding', 'adaptive_modeling']
      }
    };

    return baseGuidance[strategy] || baseGuidance.hybrid;
  }

  generateMonitoringRecommendations(strategy, analysis) {
    const monitoring = {
      metrics: [],
      alerts: [],
      optimizations: []
    };

    if (strategy === 'embedding') {
      monitoring.metrics.push(
        'Document size growth rate',
        'Embedded array length distribution',
        'Update operation performance on embedded data'
      );
      monitoring.alerts.push(
        'Document size approaching 80% of 16MB limit',
        'Embedded array length exceeding 1000 elements',
        'Update performance degradation on large embedded arrays'
      );
      monitoring.optimizations.push(
        'Consider referencing if documents exceed size thresholds',
        'Implement array size limits and archiving strategies',
        'Monitor for embedded array update bottlenecks'
      );
    } else if (strategy === 'referencing') {
      monitoring.metrics.push(
        'Query performance for multi-collection operations',
        'Reference integrity maintenance overhead',
        'Aggregation pipeline performance'
      );
      monitoring.alerts.push(
        'High latency on multi-collection queries',
        'Reference consistency violations',
        'Excessive aggregation pipeline complexity'
      );
      monitoring.optimizations.push(
        'Implement caching for frequently accessed references',
        'Consider selective denormalization for hot data paths',
        'Optimize aggregation pipelines and indexing strategies'
      );
    }

    return monitoring;
  }
}

// Export the decision engine
module.exports = { DocumentModelingDecisionEngine };

SQL-Style Document Modeling with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB document modeling operations:

-- QueryLeaf document modeling with SQL-familiar patterns

-- Create embedded document structures
CREATE COLLECTION orders_embedded AS
SELECT 
  order_id,
  order_number,
  order_date,
  status,

  -- Embed customer information
  JSON_OBJECT(
    'customer_id', customer_id,
    'company_name', company_name,
    'email', email,
    'phone', phone,
    'billing_address', JSON_OBJECT(
      'line1', billing_address_line1,
      'line2', billing_address_line2,
      'city', billing_city,
      'state', billing_state,
      'postal_code', billing_postal_code,
      'country', billing_country
    )
  ) as customer,

  -- Embed order items array
  (
    SELECT JSON_ARRAYAGG(
      JSON_OBJECT(
        'item_id', item_id,
        'product_id', product_id,
        'product', JSON_OBJECT(
          'sku', product_sku,
          'name', product_name,
          'category', product_category
        ),
        'quantity', quantity,
        'unit_price', unit_price,
        'line_total', line_total
      )
    )
    FROM order_items oi
    WHERE oi.order_id = o.order_id
  ) as items,

  -- Embed totals object
  JSON_OBJECT(
    'subtotal', subtotal,
    'tax_amount', tax_amount,
    'shipping_amount', shipping_amount,
    'total_amount', total_amount
  ) as totals,

  -- Metadata
  JSON_OBJECT(
    'created_at', created_at,
    'updated_at', updated_at,
    'modeling_strategy', 'embedded'
  ) as metadata

FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.created_at >= CURRENT_DATE - INTERVAL '30 days';

-- Query embedded documents with SQL syntax
SELECT 
  order_number,
  status,
  customer->>'company_name' as customer_name,
  customer->'billing_address'->>'city' as billing_city,
  totals->>'total_amount'::DECIMAL as order_total,

  -- Query embedded array elements
  JSON_ARRAY_LENGTH(items) as item_count,

  -- Extract specific item information
  (
    SELECT SUM((item->>'quantity')::INTEGER)
    FROM JSON_ARRAY_ELEMENTS(items) as item
  ) as total_quantity,

  -- Get product categories from embedded items
  (
    SELECT ARRAY_AGG(DISTINCT item->'product'->>'category')
    FROM JSON_ARRAY_ELEMENTS(items) as item
  ) as product_categories

FROM orders_embedded
WHERE status = 'pending'
  AND customer->>'email' LIKE '%@company.com'
ORDER BY totals->>'total_amount'::DECIMAL DESC;

-- Create referenced document structures
CREATE COLLECTION orders_referenced AS
SELECT 
  order_id,
  order_number,
  order_date,
  status,
  customer_id, -- Reference to customers collection

  -- Order totals (calculated fields)
  subtotal,
  tax_amount,
  shipping_amount,
  total_amount,

  JSON_OBJECT(
    'created_at', created_at,
    'updated_at', updated_at,
    'modeling_strategy', 'referenced'
  ) as metadata

FROM orders
WHERE created_at >= CURRENT_DATE - INTERVAL '30 days';

-- Query referenced documents with joins (QueryLeaf aggregation syntax)
SELECT 
  o.order_number,
  o.status,
  o.total_amount,

  -- Join with customer collection
  c.company_name,
  c.email,

  -- Aggregate order items
  (
    SELECT JSON_ARRAYAGG(
      JSON_OBJECT(
        'product_sku', oi.product_sku,
        'product_name', oi.product_name,
        'quantity', oi.quantity,
        'unit_price', oi.unit_price,
        'line_total', oi.line_total
      )
    )
    FROM order_items_referenced oi
    WHERE oi.order_id = o.order_id
  ) as items,

  -- Get status history
  (
    SELECT JSON_ARRAYAGG(
      JSON_OBJECT(
        'status', new_status,
        'changed_at', status_changed_at,
        'reason', status_change_reason
      ) ORDER BY status_changed_at DESC
    )
    FROM order_status_history osh
    WHERE osh.order_id = o.order_id
  ) as status_history

FROM orders_referenced o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE o.status IN ('pending', 'processing')
ORDER BY o.order_date DESC;

-- Hybrid document modeling approach
CREATE COLLECTION orders_hybrid AS
SELECT 
  order_id,
  order_number,
  order_date,
  status,

  -- Hybrid customer: embed summary, reference full details
  JSON_OBJECT(
    'customer_id', customer_id,
    'company_name', company_name,
    'email', email,
    'customer_ref', JSON_OBJECT(
      'collection', 'customers',
      'id', customer_id
    )
  ) as customer,

  -- Hybrid items: embed summary for quick access
  JSON_OBJECT(
    'total_items', (SELECT COUNT(*) FROM order_items WHERE order_id = o.order_id),
    'unique_products', (SELECT COUNT(DISTINCT product_id) FROM order_items WHERE order_id = o.order_id),
    'total_quantity', (SELECT SUM(quantity) FROM order_items WHERE order_id = o.order_id),

    -- Embed small item previews
    'quick_view', (
      SELECT JSON_ARRAYAGG(
        JSON_OBJECT(
          'product_sku', product_sku,
          'product_name', product_name,
          'quantity', quantity,
          'line_total', line_total
        )
      )
      FROM (
        SELECT * FROM order_items 
        WHERE order_id = o.order_id 
        ORDER BY line_total DESC 
        LIMIT 3
      ) top_items
    ),

    -- Reference for complete items if needed
    'items_collection', CASE 
      WHEN (SELECT COUNT(*) FROM order_items WHERE order_id = o.order_id) > 10 
      THEN JSON_OBJECT('collection', 'order_items_detailed', 'order_id', order_id)
      ELSE NULL
    END
  ) as items_summary,

  -- Embed current status, reference full history
  JSON_OBJECT(
    'current_status', status,
    'status_updated_at', updated_at,
    'status_history_ref', JSON_OBJECT(
      'collection', 'order_status_history',
      'order_id', order_id
    )
  ) as status_info,

  -- Totals
  JSON_OBJECT(
    'subtotal', subtotal,
    'tax_amount', tax_amount,
    'shipping_amount', shipping_amount,
    'total_amount', total_amount
  ) as totals,

  -- Metadata with modeling decisions
  JSON_OBJECT(
    'created_at', created_at,
    'updated_at', updated_at,
    'modeling_strategy', 'hybrid',
    'embedding_decisions', JSON_OBJECT(
      'customer', 'partial_denormalization',
      'items', CASE 
        WHEN (SELECT COUNT(*) FROM order_items WHERE order_id = o.order_id) <= 10 
        THEN 'embedded' 
        ELSE 'referenced_with_summary' 
      END,
      'status', 'current_embedded_history_referenced'
    )
  ) as metadata

FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;

-- Performance analysis for document modeling strategies
WITH modeling_performance AS (
  SELECT 
    'embedded' as strategy,
    COUNT(*) as document_count,
    AVG(LENGTH(JSON_SERIALIZE(items))) as avg_items_size,
    AVG(JSON_ARRAY_LENGTH(items)) as avg_item_count,
    MAX(JSON_ARRAY_LENGTH(items)) as max_item_count,

    -- Query performance metrics
    AVG(query_execution_time_ms) as avg_query_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY query_execution_time_ms) as p95_query_time

  FROM orders_embedded_performance_log
  WHERE query_date >= CURRENT_DATE - INTERVAL '7 days'

  UNION ALL

  SELECT 
    'referenced' as strategy,
    COUNT(*) as document_count,
    NULL as avg_items_size,
    AVG((
      SELECT COUNT(*) 
      FROM order_items_referenced oir 
      WHERE oir.order_id = orp.order_id
    )) as avg_item_count,
    MAX((
      SELECT COUNT(*) 
      FROM order_items_referenced oir 
      WHERE oir.order_id = orp.order_id
    )) as max_item_count,

    AVG(query_execution_time_ms) as avg_query_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY query_execution_time_ms) as p95_query_time

  FROM orders_referenced_performance_log orp
  WHERE query_date >= CURRENT_DATE - INTERVAL '7 days'

  UNION ALL

  SELECT 
    'hybrid' as strategy,
    COUNT(*) as document_count,
    AVG(LENGTH(JSON_SERIALIZE(items_summary))) as avg_items_size,
    AVG(items_summary->>'total_items'::INTEGER) as avg_item_count,
    MAX(items_summary->>'total_items'::INTEGER) as max_item_count,

    AVG(query_execution_time_ms) as avg_query_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY query_execution_time_ms) as p95_query_time

  FROM orders_hybrid_performance_log
  WHERE query_date >= CURRENT_DATE - INTERVAL '7 days'
)

SELECT 
  strategy,
  document_count,
  avg_item_count,
  max_item_count,

  -- Performance comparison
  ROUND(avg_query_time::NUMERIC, 2) as avg_query_time_ms,
  ROUND(p95_query_time::NUMERIC, 2) as p95_query_time_ms,

  -- Performance rating
  CASE 
    WHEN avg_query_time <= 50 THEN 'excellent'
    WHEN avg_query_time <= 200 THEN 'good'
    WHEN avg_query_time <= 500 THEN 'acceptable'
    ELSE 'needs_optimization'
  END as performance_rating,

  -- Strategy recommendations
  CASE strategy
    WHEN 'embedded' THEN
      CASE 
        WHEN max_item_count > 100 THEN 'Consider hybrid approach for large orders'
        WHEN avg_query_time > 200 THEN 'Monitor document size and query complexity'
        ELSE 'Strategy performing well for current data patterns'
      END
    WHEN 'referenced' THEN
      CASE 
        WHEN avg_query_time > 500 THEN 'Consider hybrid with summary embedding for performance'
        WHEN p95_query_time > 1000 THEN 'Optimize indexes and aggregation pipelines'
        ELSE 'Strategy suitable for large, complex data relationships'
      END
    WHEN 'hybrid' THEN
      CASE 
        WHEN avg_query_time > avg_query_time * 1.5 THEN 'Review embedding decisions and access patterns'
        ELSE 'Balanced approach providing good performance and flexibility'
      END
  END as strategy_recommendation

FROM modeling_performance
ORDER BY avg_query_time;

-- Document modeling decision support system
CREATE VIEW document_modeling_recommendations AS
WITH relationship_analysis AS (
  SELECT 
    table_name as parent_collection,
    related_table as child_collection,
    relationship_type,

    -- Data volume analysis
    (SELECT COUNT(*) FROM information_schema.tables WHERE table_name = related_table) as child_document_count,
    AVG(pg_column_size(row_to_json(r.*))) as avg_child_size,
    MAX(pg_column_size(row_to_json(r.*))) as max_child_size,

    -- Relationship cardinality
    CASE 
      WHEN relationship_type = 'one_to_one' THEN 1
      WHEN relationship_type = 'one_to_few' THEN 10
      WHEN relationship_type = 'one_to_many' THEN 1000
      ELSE 10000
    END as estimated_child_count,

    -- Update patterns
    (
      SELECT COUNT(*) 
      FROM update_frequency_log ufl 
      WHERE ufl.table_name = related_table 
        AND ufl.log_date >= CURRENT_DATE - INTERVAL '7 days'
    ) as weekly_updates,

    -- Access patterns
    (
      SELECT COUNT(*) 
      FROM query_log ql 
      WHERE ql.query_text LIKE '%JOIN%' 
        AND ql.query_text LIKE '%' || table_name || '%'
        AND ql.query_text LIKE '%' || related_table || '%'
        AND ql.query_date >= CURRENT_DATE - INTERVAL '7 days'
    ) as joint_queries_weekly,

    (
      SELECT COUNT(*) 
      FROM query_log ql 
      WHERE ql.query_text LIKE '%' || related_table || '%'
        AND ql.query_text NOT LIKE '%JOIN%'
        AND ql.query_date >= CURRENT_DATE - INTERVAL '7 days'
    ) as independent_queries_weekly

  FROM relationship_metadata rm
  WHERE rm.target_system = 'mongodb'
)

SELECT 
  parent_collection,
  child_collection,
  relationship_type,

  -- Volume-based recommendation
  CASE 
    WHEN estimated_child_count * avg_child_size > 10485760 THEN 'reference' -- 10MB threshold
    WHEN estimated_child_count <= 100 AND avg_child_size <= 10240 THEN 'embed' -- Small documents
    ELSE 'hybrid'
  END as volume_recommendation,

  -- Access pattern recommendation
  CASE 
    WHEN joint_queries_weekly > independent_queries_weekly * 2 THEN 'embed'
    WHEN independent_queries_weekly > joint_queries_weekly * 2 THEN 'reference'
    ELSE 'hybrid'
  END as access_pattern_recommendation,

  -- Update frequency recommendation
  CASE 
    WHEN weekly_updates > 1000 THEN 'reference'
    WHEN weekly_updates < 100 THEN 'embed'
    ELSE 'hybrid'
  END as update_frequency_recommendation,

  -- Combined recommendation with confidence
  CASE 
    WHEN 
      (CASE WHEN estimated_child_count * avg_child_size > 10485760 THEN 1 ELSE 0 END) +
      (CASE WHEN independent_queries_weekly > joint_queries_weekly * 2 THEN 1 ELSE 0 END) +
      (CASE WHEN weekly_updates > 1000 THEN 1 ELSE 0 END) >= 2
    THEN 'reference'

    WHEN 
      (CASE WHEN estimated_child_count <= 100 AND avg_child_size <= 10240 THEN 1 ELSE 0 END) +
      (CASE WHEN joint_queries_weekly > independent_queries_weekly * 2 THEN 1 ELSE 0 END) +
      (CASE WHEN weekly_updates < 100 THEN 1 ELSE 0 END) >= 2
    THEN 'embed'

    ELSE 'hybrid'
  END as final_recommendation,

  -- Confidence calculation
  CASE 
    WHEN ABS((joint_queries_weekly::DECIMAL / NULLIF(independent_queries_weekly, 0)) - 1) > 2 THEN 'high'
    WHEN ABS((joint_queries_weekly::DECIMAL / NULLIF(independent_queries_weekly, 0)) - 1) > 0.5 THEN 'medium'
    ELSE 'low'
  END as recommendation_confidence,

  -- Implementation guidance
  CASE 
    WHEN relationship_type = 'one_to_one' AND avg_child_size < 5120 
    THEN 'Embed child document directly in parent'

    WHEN relationship_type = 'one_to_few' AND estimated_child_count <= 50
    THEN 'Embed as array in parent document'

    WHEN relationship_type = 'one_to_many' AND joint_queries_weekly > 500
    THEN 'Consider hybrid: embed summary, reference details'

    WHEN relationship_type = 'many_to_many'
    THEN 'Use references with junction collection or arrays of references'

    ELSE 'Analyze specific access patterns and data growth projections'
  END as implementation_guidance,

  -- Monitoring recommendations
  ARRAY[
    CASE WHEN estimated_child_count * avg_child_size > 5242880 THEN 'Monitor document size growth' END,
    CASE WHEN weekly_updates > 500 THEN 'Monitor update performance on embedded arrays' END,
    CASE WHEN independent_queries_weekly > 1000 THEN 'Consider indexing strategies for referenced collections' END,
    'Track query performance after modeling implementation',
    'Monitor data growth patterns and access frequency changes'
  ] as monitoring_checklist

FROM relationship_analysis
ORDER BY parent_collection, child_collection;

-- QueryLeaf provides comprehensive document modeling capabilities:
-- 1. SQL-familiar syntax for creating embedded and referenced document structures
-- 2. Flexible querying of complex nested documents with JSON operators
-- 3. Performance analysis comparing different modeling strategies
-- 4. Automated recommendations based on data characteristics and access patterns
-- 5. Hybrid modeling support with selective embedding and referencing
-- 6. Decision support systems for optimal modeling strategy selection
-- 7. Integration with MongoDB's native document operations and indexing
-- 8. Production-ready patterns for scalable document-based applications
-- 9. Monitoring and optimization guidance for document modeling decisions
-- 10. Enterprise-grade document modeling accessible through familiar SQL constructs

Best Practices for MongoDB Document Modeling

Embedding vs Referencing Decision Guidelines

Essential practices for choosing optimal document modeling strategies:

One-to-Few Relationships: Embed small, related collections (< 100 documents) that are frequently accessed together
One-to-Many Relationships: Use references for large collections that may grow beyond document size limits
Data Update Patterns: Embed stable data, reference frequently updated data to avoid complex array updates
Query Optimization: Embed data that is commonly queried together to minimize database round trips
Document Size Management: Monitor embedded collections to prevent approaching the 16MB document limit
Access Pattern Analysis: Choose modeling strategy based on whether data is primarily accessed jointly or independently

Performance Optimization Strategies

Optimize document models for maximum application performance:

Strategic Denormalization: Embed frequently accessed data even if it creates some duplication
Index Optimization: Create appropriate indexes on embedded fields and reference keys
Hybrid Approaches: Combine embedding and referencing based on data access patterns and volume
Document Structure: Design document schemas that align with application query patterns
Array Management: Limit embedded array sizes and implement archiving for historical data
Caching Strategies: Implement application-level caching for frequently accessed referenced data

Conclusion

MongoDB document modeling provides flexible strategies for structuring data that can dramatically improve application performance and simplify development complexity. The choice between embedding, referencing, and hybrid approaches depends on specific data characteristics, access patterns, and scalability requirements.

Key MongoDB document modeling benefits include:

Flexible Data Structures: Native support for complex nested documents and arrays eliminates rigid relational constraints
Optimized Read Performance: Strategic embedding enables single-query retrieval of complete entity data
Atomic Operations: Document-level atomic updates provide consistency for related data without complex transactions
Scalable Patterns: Hybrid approaches that adapt to data volume and access pattern changes over time
Simplified Application Logic: Natural object mapping reduces impedance mismatch between database and application models
SQL Compatibility: Familiar document modeling patterns accessible through SQL-style operations

Whether you're building e-commerce platforms, content management systems, or IoT data collection applications, MongoDB's document modeling flexibility with QueryLeaf's SQL-familiar interface provides the foundation for scalable data architecture that maintains high performance while adapting to evolving business requirements.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB document modeling while providing SQL-familiar syntax for creating embedded and referenced document structures. Advanced modeling decision support, performance analysis, and hybrid pattern implementation are seamlessly accessible through familiar SQL constructs, making sophisticated document modeling both powerful and approachable for SQL-oriented teams.

The combination of MongoDB's flexible document modeling with familiar SQL-style management makes it an ideal platform for applications that require both sophisticated data relationships and operational simplicity, ensuring your data architecture scales efficiently while maintaining familiar development and operational patterns.

November 18, 2025
20 min read

MongoDB Bulk Write Operations for High-Performance Data Processing: Enterprise-Scale Data Ingestion and Batch Processing with SQL-Compatible Patterns

Enterprise applications frequently need to process large volumes of data efficiently, whether importing CSV files, synchronizing with external systems, or performing batch transformations. Traditional row-by-row database operations create significant performance bottlenecks and resource overhead when processing thousands or millions of records, leading to extended processing times and poor user experiences.

MongoDB Bulk Write Operations provide sophisticated batch processing capabilities that dramatically improve throughput by combining multiple write operations into optimized batch requests. Unlike traditional databases that require complex stored procedures or external ETL tools, MongoDB's native bulk operations integrate seamlessly with application code while delivering enterprise-grade performance and reliability.

The Traditional Batch Processing Challenge

Processing large datasets with conventional database approaches creates significant performance and operational challenges:

-- Traditional PostgreSQL batch processing - inefficient row-by-row operations

-- Product catalog import with individual INSERT statements
-- This approach creates massive performance problems at scale

DO $$
DECLARE
    product_record RECORD;
    import_cursor CURSOR FOR 
        SELECT * FROM product_import_staging;
    total_processed INTEGER := 0;
    batch_size INTEGER := 1000;
    start_time TIMESTAMP;
    current_batch_time TIMESTAMP;

BEGIN
    start_time := CURRENT_TIMESTAMP;

    -- Process each record individually - extremely inefficient
    FOR product_record IN import_cursor LOOP
        -- Individual validation and processing
        BEGIN
            -- Product existence check (N+1 query problem)
            IF EXISTS (
                SELECT 1 FROM products 
                WHERE sku = product_record.sku
            ) THEN
                -- Update existing product
                UPDATE products 
                SET 
                    name = product_record.name,
                    description = product_record.description,
                    price = product_record.price,
                    category_id = (
                        SELECT category_id 
                        FROM categories 
                        WHERE category_name = product_record.category_name
                    ),
                    stock_quantity = product_record.stock_quantity,
                    weight_kg = product_record.weight_kg,
                    dimensions_json = product_record.dimensions_json::JSONB,
                    supplier_id = (
                        SELECT supplier_id 
                        FROM suppliers 
                        WHERE supplier_code = product_record.supplier_code
                    ),

                    -- Pricing and inventory details
                    cost_price = product_record.cost_price,
                    margin_percent = product_record.margin_percent,
                    tax_category = product_record.tax_category,
                    minimum_order_quantity = product_record.minimum_order_quantity,
                    lead_time_days = product_record.lead_time_days,

                    -- Status and lifecycle
                    status = product_record.status,
                    is_active = product_record.is_active,
                    availability_date = product_record.availability_date::DATE,
                    discontinuation_date = product_record.discontinuation_date::DATE,

                    -- SEO and marketing
                    seo_title = product_record.seo_title,
                    seo_description = product_record.seo_description,
                    keywords_array = string_to_array(product_record.keywords, ','),

                    -- Audit fields
                    updated_at = CURRENT_TIMESTAMP,
                    updated_by = 'bulk_import_system'

                WHERE sku = product_record.sku;

            ELSE
                -- Insert new product with complex validation
                INSERT INTO products (
                    sku, name, description, price, category_id,
                    stock_quantity, weight_kg, dimensions_json,
                    supplier_id, cost_price, margin_percent,
                    tax_category, minimum_order_quantity, lead_time_days,
                    status, is_active, availability_date, discontinuation_date,
                    seo_title, seo_description, keywords_array,
                    created_at, updated_at, created_by
                ) VALUES (
                    product_record.sku,
                    product_record.name,
                    product_record.description,
                    product_record.price,
                    (SELECT category_id FROM categories WHERE category_name = product_record.category_name),
                    product_record.stock_quantity,
                    product_record.weight_kg,
                    product_record.dimensions_json::JSONB,
                    (SELECT supplier_id FROM suppliers WHERE supplier_code = product_record.supplier_code),
                    product_record.cost_price,
                    product_record.margin_percent,
                    product_record.tax_category,
                    product_record.minimum_order_quantity,
                    product_record.lead_time_days,
                    product_record.status,
                    product_record.is_active,
                    product_record.availability_date::DATE,
                    product_record.discontinuation_date::DATE,
                    product_record.seo_title,
                    product_record.seo_description,
                    string_to_array(product_record.keywords, ','),
                    CURRENT_TIMESTAMP,
                    CURRENT_TIMESTAMP,
                    'bulk_import_system'
                );

            END IF;

            -- Process product variants (additional N+1 queries)
            IF product_record.variants_json IS NOT NULL THEN
                INSERT INTO product_variants (
                    product_sku,
                    variant_sku,
                    variant_attributes,
                    price_adjustment,
                    stock_quantity,
                    created_at
                )
                SELECT 
                    product_record.sku,
                    variant->>'sku',
                    variant->'attributes',
                    (variant->>'price_adjustment')::DECIMAL,
                    (variant->>'stock_quantity')::INTEGER,
                    CURRENT_TIMESTAMP
                FROM jsonb_array_elements(product_record.variants_json::JSONB) AS variant
                ON CONFLICT (variant_sku) DO UPDATE SET
                    variant_attributes = EXCLUDED.variant_attributes,
                    price_adjustment = EXCLUDED.price_adjustment,
                    stock_quantity = EXCLUDED.stock_quantity,
                    updated_at = CURRENT_TIMESTAMP;
            END IF;

            -- Update inventory tracking
            INSERT INTO inventory_transactions (
                product_sku,
                transaction_type,
                quantity_change,
                new_quantity,
                reason,
                created_at
            ) VALUES (
                product_record.sku,
                'bulk_import',
                product_record.stock_quantity,
                product_record.stock_quantity,
                'Product catalog import',
                CURRENT_TIMESTAMP
            );

            total_processed := total_processed + 1;

            -- Periodic progress reporting (every 1000 records)
            IF total_processed % batch_size = 0 THEN
                current_batch_time := CURRENT_TIMESTAMP;
                RAISE NOTICE 'Processed % records. Current batch time: % seconds', 
                    total_processed, 
                    EXTRACT(EPOCH FROM (current_batch_time - start_time));

                -- Commit intermediate results (but lose atomicity)
                COMMIT;
            END IF;

        EXCEPTION WHEN OTHERS THEN
            -- Log error and continue (poor error handling)
            INSERT INTO import_errors (
                record_data,
                error_message,
                created_at
            ) VALUES (
                row_to_json(product_record)::TEXT,
                SQLERRM,
                CURRENT_TIMESTAMP
            );

            RAISE NOTICE 'Error processing record %: %', product_record.sku, SQLERRM;
            CONTINUE;
        END;

    END LOOP;

    RAISE NOTICE 'Import completed. Total processed: % records in % seconds', 
        total_processed, 
        EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - start_time));

END $$;

-- Problems with traditional row-by-row processing:
-- 1. Each operation requires a separate database round-trip
-- 2. N+1 query problems for lookups and validations
-- 3. No atomic batch operations - partial failures leave data inconsistent
-- 4. Poor performance - processing 100K records can take hours
-- 5. Resource intensive - high CPU and memory usage per operation
-- 6. Limited error handling - difficult to handle partial batch failures
-- 7. Complex transaction management across large datasets
-- 8. Manual progress tracking and monitoring implementation
-- 9. No built-in retry logic for transient failures
-- 10. Difficult to optimize - requires stored procedures or external tools

MongoDB provides native bulk operations with intelligent batching and error handling:

// MongoDB Bulk Write Operations - high-performance batch processing
const { MongoClient } = require('mongodb');

// Advanced MongoDB Bulk Operations Manager
class MongoBulkOperationsManager {
  constructor() {
    this.client = null;
    this.db = null;
    this.operationMetrics = new Map();
    this.errorHandlers = new Map();
    this.batchConfigurations = new Map();
  }

  async initialize() {
    console.log('Initializing MongoDB Bulk Operations Manager...');

    // Connect with optimized settings for bulk operations
    this.client = new MongoClient(process.env.MONGODB_URI || 'mongodb://localhost:27017', {
      // Connection pool optimized for bulk operations
      minPoolSize: 5,
      maxPoolSize: 20,
      maxIdleTimeMS: 30000,

      // Write concern optimized for throughput
      writeConcern: { w: 1, j: false }, // Faster for bulk imports
      readPreference: 'primary',

      // Batch operation optimizations
      maxBsonObjectSize: 16777216, // 16MB BSON limit
      compression: ['zlib'], // Reduce network overhead

      appName: 'BulkOperationsManager'
    });

    await this.client.connect();
    this.db = this.client.db('production');

    // Initialize batch configurations for different operation types
    await this.setupBatchConfigurations();

    console.log('✅ MongoDB Bulk Operations Manager initialized');
  }

  async setupBatchConfigurations() {
    console.log('Setting up batch operation configurations...');

    // Define optimized batch configurations for different scenarios
    const configurations = {
      // High-throughput product catalog imports
      'product_import': {
        batchSize: 1000,          // Optimal balance of memory and performance
        maxBatchSizeBytes: 10485760, // 10MB max batch size
        ordered: false,           // Allow parallel processing
        timeout: 300000,          // 5 minute timeout
        retryAttempts: 3,
        retryDelay: 1000,

        // Validation settings
        bypassDocumentValidation: false,
        validateDocuments: true,

        // Performance monitoring
        trackMetrics: true,
        logProgress: true,
        progressInterval: 5000
      },

      // Real-time order processing updates
      'order_updates': {
        batchSize: 500,
        maxBatchSizeBytes: 5242880, // 5MB max
        ordered: true,            // Maintain order for business logic
        timeout: 60000,           // 1 minute timeout
        retryAttempts: 5,
        retryDelay: 500,

        // Strict validation for financial data
        bypassDocumentValidation: false,
        validateDocuments: true,

        trackMetrics: true,
        logProgress: true,
        progressInterval: 1000
      },

      // Analytics data aggregation
      'analytics_batch': {
        batchSize: 2000,          // Larger batches for analytics
        maxBatchSizeBytes: 15728640, // 15MB max
        ordered: false,
        timeout: 600000,          // 10 minute timeout
        retryAttempts: 2,
        retryDelay: 2000,

        // Relaxed validation for analytical data
        bypassDocumentValidation: true,
        validateDocuments: false,

        trackMetrics: true,
        logProgress: true,
        progressInterval: 10000
      },

      // Log data ingestion (high volume, low latency)
      'log_ingestion': {
        batchSize: 5000,          // Very large batches
        maxBatchSizeBytes: 20971520, // 20MB max
        ordered: false,
        timeout: 120000,          // 2 minute timeout
        retryAttempts: 1,         // Minimal retries for logs
        retryDelay: 100,

        // Minimal validation for high throughput
        bypassDocumentValidation: true,
        validateDocuments: false,

        trackMetrics: false,      // Reduce overhead
        logProgress: false,
        progressInterval: 50000
      }
    };

    for (const [configName, config] of Object.entries(configurations)) {
      this.batchConfigurations.set(configName, config);
    }

    console.log('✅ Batch configurations initialized');
  }

  async performBulkProductImport(productData, options = {}) {
    console.log(`Starting bulk product import for ${productData.length} products...`);

    const config = this.batchConfigurations.get('product_import');
    const collection = this.db.collection('products');

    // Prepare bulk operations with sophisticated error handling
    const bulkOps = [];
    const processingMetrics = {
      startTime: Date.now(),
      totalRecords: productData.length,
      processedRecords: 0,
      successfulOperations: 0,
      failedOperations: 0,
      errors: [],
      batchTimes: []
    };

    try {
      // Prepare bulk write operations
      for (const product of productData) {
        const operation = await this.createProductOperation(product);
        if (operation) {
          bulkOps.push(operation);
        } else {
          processingMetrics.failedOperations++;
        }
      }

      // Execute bulk operations in optimized batches
      const results = await this.executeBulkOperations(
        collection, 
        bulkOps, 
        config, 
        processingMetrics
      );

      const totalTime = Date.now() - processingMetrics.startTime;

      console.log('✅ Bulk product import completed:', {
        totalRecords: processingMetrics.totalRecords,
        processedRecords: processingMetrics.processedRecords,
        successfulOperations: processingMetrics.successfulOperations,
        failedOperations: processingMetrics.failedOperations,
        totalTimeMs: totalTime,
        throughputPerSecond: Math.round((processingMetrics.successfulOperations / totalTime) * 1000),
        errorRate: ((processingMetrics.failedOperations / processingMetrics.totalRecords) * 100).toFixed(2) + '%'
      });

      return {
        success: true,
        results: results,
        metrics: processingMetrics,
        recommendations: this.generatePerformanceRecommendations(processingMetrics)
      };

    } catch (error) {
      console.error('Bulk product import failed:', error);
      return {
        success: false,
        error: error.message,
        metrics: processingMetrics,
        partialResults: processingMetrics.successfulOperations > 0
      };
    }
  }

  async createProductOperation(productData) {
    try {
      // Comprehensive data transformation and validation
      const transformedProduct = {
        // Core product information
        sku: productData.sku,
        name: productData.name,
        description: productData.description,

        // Pricing and financial data
        pricing: {
          basePrice: parseFloat(productData.price) || 0,
          costPrice: parseFloat(productData.cost_price) || 0,
          marginPercent: parseFloat(productData.margin_percent) || 0,
          currency: productData.currency || 'USD',
          taxCategory: productData.tax_category || 'standard'
        },

        // Inventory management
        inventory: {
          stockQuantity: parseInt(productData.stock_quantity) || 0,
          minimumOrderQuantity: parseInt(productData.minimum_order_quantity) || 1,
          leadTimeDays: parseInt(productData.lead_time_days) || 0,
          trackInventory: productData.track_inventory !== 'false'
        },

        // Product specifications
        specifications: {
          weight: {
            value: parseFloat(productData.weight_kg) || 0,
            unit: 'kg'
          },
          dimensions: productData.dimensions_json ? 
            JSON.parse(productData.dimensions_json) : null,
          attributes: productData.attributes_json ? 
            JSON.parse(productData.attributes_json) : {}
        },

        // Categorization and classification
        classification: {
          categoryId: productData.category_id,
          categoryPath: productData.category_path ? 
            productData.category_path.split('/') : [],
          tags: productData.tags ? 
            productData.tags.split(',').map(tag => tag.trim()) : [],
          brand: productData.brand || null
        },

        // Supplier and sourcing information
        supplier: {
          supplierId: productData.supplier_id,
          supplierCode: productData.supplier_code,
          supplierProductCode: productData.supplier_product_code,
          leadTimeFromSupplier: parseInt(productData.supplier_lead_time) || 0
        },

        // Product lifecycle and status
        lifecycle: {
          status: productData.status || 'active',
          isActive: productData.is_active !== 'false',
          availabilityDate: productData.availability_date ? 
            new Date(productData.availability_date) : new Date(),
          discontinuationDate: productData.discontinuation_date ? 
            new Date(productData.discontinuation_date) : null,
          seasonality: productData.seasonality || null
        },

        // SEO and marketing data
        marketing: {
          seoTitle: productData.seo_title || productData.name,
          seoDescription: productData.seo_description || productData.description,
          keywords: productData.keywords ? 
            productData.keywords.split(',').map(kw => kw.trim()) : [],
          promotionalText: productData.promotional_text || null,
          featuredProduct: productData.featured_product === 'true'
        },

        // Product variants and options
        variants: productData.variants_json ? 
          JSON.parse(productData.variants_json).map(variant => ({
            variantId: variant.variant_id || this.generateVariantId(),
            variantSku: variant.sku,
            attributes: variant.attributes || {},
            priceAdjustment: parseFloat(variant.price_adjustment) || 0,
            stockQuantity: parseInt(variant.stock_quantity) || 0,
            isActive: variant.is_active !== 'false'
          })) : [],

        // Media and assets
        media: {
          images: productData.images_json ? 
            JSON.parse(productData.images_json) : [],
          documents: productData.documents_json ? 
            JSON.parse(productData.documents_json) : [],
          videos: productData.videos_json ? 
            JSON.parse(productData.videos_json) : []
        },

        // Compliance and regulatory
        compliance: {
          regulatoryInfo: productData.regulatory_info_json ? 
            JSON.parse(productData.regulatory_info_json) : {},
          certifications: productData.certifications ? 
            productData.certifications.split(',').map(cert => cert.trim()) : [],
          restrictions: productData.restrictions_json ? 
            JSON.parse(productData.restrictions_json) : {}
        },

        // Audit and tracking
        audit: {
          createdAt: new Date(),
          updatedAt: new Date(),
          createdBy: 'bulk_import_system',
          importBatch: options.batchId || this.generateBatchId(),
          dataSource: options.dataSource || 'csv_import',
          version: 1
        }
      };

      // Determine operation type (insert vs update)
      const existingProduct = await this.db.collection('products')
        .findOne({ sku: transformedProduct.sku }, { projection: { _id: 1, audit: 1 } });

      if (existingProduct) {
        // Update existing product
        transformedProduct.audit.updatedAt = new Date();
        transformedProduct.audit.version = (existingProduct.audit?.version || 1) + 1;

        return {
          updateOne: {
            filter: { sku: transformedProduct.sku },
            update: { $set: transformedProduct },
            upsert: false
          }
        };
      } else {
        // Insert new product
        return {
          insertOne: {
            document: transformedProduct
          }
        };
      }

    } catch (error) {
      console.error(`Error creating operation for product ${productData.sku}:`, error);
      return null;
    }
  }

  async executeBulkOperations(collection, operations, config, metrics) {
    const results = [];
    const totalBatches = Math.ceil(operations.length / config.batchSize);

    console.log(`Executing ${operations.length} operations in ${totalBatches} batches...`);

    for (let i = 0; i < operations.length; i += config.batchSize) {
      const batchStart = Date.now();
      const batch = operations.slice(i, i + config.batchSize);
      const batchNumber = Math.floor(i / config.batchSize) + 1;

      try {
        // Execute bulk write with optimized options
        const result = await collection.bulkWrite(batch, {
          ordered: config.ordered,
          bypassDocumentValidation: config.bypassDocumentValidation,
          writeConcern: { w: 1, j: false }, // Optimized for throughput
        });

        // Track successful operations
        metrics.successfulOperations += result.insertedCount + result.modifiedCount + result.upsertedCount;
        metrics.processedRecords += batch.length;

        const batchTime = Date.now() - batchStart;
        metrics.batchTimes.push(batchTime);

        results.push({
          batchNumber: batchNumber,
          batchSize: batch.length,
          result: result,
          processingTimeMs: batchTime,
          throughputPerSecond: Math.round((batch.length / batchTime) * 1000)
        });

        // Progress logging
        if (config.logProgress && batchNumber % Math.ceil(totalBatches / 10) === 0) {
          const progressPercent = ((i + batch.length) / operations.length * 100).toFixed(1);
          console.log(`Batch ${batchNumber}/${totalBatches} completed (${progressPercent}%) - ${Math.round((batch.length / batchTime) * 1000)} ops/sec`);
        }

      } catch (error) {
        console.error(`Batch ${batchNumber} failed:`, error);

        metrics.failedOperations += batch.length;
        metrics.errors.push({
          batchNumber: batchNumber,
          error: error.message,
          batchSize: batch.length,
          timestamp: new Date()
        });

        // Implement retry logic for transient errors
        if (config.retryAttempts > 0 && this.isRetryableError(error)) {
          console.log(`Retrying batch ${batchNumber} in ${config.retryDelay}ms...`);
          await new Promise(resolve => setTimeout(resolve, config.retryDelay));

          try {
            const retryResult = await collection.bulkWrite(batch, {
              ordered: config.ordered,
              bypassDocumentValidation: config.bypassDocumentValidation,
              writeConcern: { w: 1, j: false }
            });

            metrics.successfulOperations += retryResult.insertedCount + retryResult.modifiedCount + retryResult.upsertedCount;
            metrics.processedRecords += batch.length;

            results.push({
              batchNumber: batchNumber,
              batchSize: batch.length,
              result: retryResult,
              processingTimeMs: Date.now() - batchStart,
              retryAttempt: true
            });

            console.log(`✅ Retry successful for batch ${batchNumber}`);

          } catch (retryError) {
            console.error(`Retry failed for batch ${batchNumber}:`, retryError);
            metrics.errors.push({
              batchNumber: batchNumber,
              error: `Retry failed: ${retryError.message}`,
              batchSize: batch.length,
              timestamp: new Date()
            });
          }
        }
      }
    }

    return results;
  }

  isRetryableError(error) {
    // Define retryable error conditions
    const retryableErrors = [
      'network timeout',
      'connection pool timeout',
      'temporary failure',
      'server selection timeout',
      'connection interrupted'
    ];

    return retryableErrors.some(retryableError => 
      error.message.toLowerCase().includes(retryableError)
    );
  }

  generateVariantId() {
    return 'var_' + Date.now() + '_' + Math.random().toString(36).substr(2, 9);
  }

  generateBatchId() {
    return 'batch_' + new Date().toISOString().replace(/[:.]/g, '') + '_' + Math.random().toString(36).substr(2, 6);
  }

  generatePerformanceRecommendations(metrics) {
    const recommendations = [];
    const totalTime = Date.now() - metrics.startTime;
    const overallThroughput = (metrics.successfulOperations / totalTime) * 1000;

    // Throughput analysis
    if (overallThroughput < 100) {
      recommendations.push({
        type: 'performance',
        priority: 'high',
        message: 'Low throughput detected. Consider increasing batch size or optimizing document structure.',
        currentValue: Math.round(overallThroughput),
        targetValue: 500
      });
    }

    // Error rate analysis
    const errorRate = (metrics.failedOperations / metrics.totalRecords) * 100;
    if (errorRate > 5) {
      recommendations.push({
        type: 'reliability',
        priority: 'high',
        message: 'High error rate. Review data quality and validation rules.',
        currentValue: errorRate.toFixed(2) + '%',
        targetValue: '< 1%'
      });
    }

    // Batch timing analysis
    if (metrics.batchTimes.length > 0) {
      const avgBatchTime = metrics.batchTimes.reduce((a, b) => a + b, 0) / metrics.batchTimes.length;
      const maxBatchTime = Math.max(...metrics.batchTimes);

      if (maxBatchTime > avgBatchTime * 3) {
        recommendations.push({
          type: 'optimization',
          priority: 'medium',
          message: 'Inconsistent batch processing times. Consider connection pool tuning.',
          currentValue: `${Math.round(maxBatchTime)}ms max, ${Math.round(avgBatchTime)}ms avg`
        });
      }
    }

    return recommendations.length > 0 ? recommendations : [
      { type: 'status', priority: 'info', message: 'Bulk operations performing optimally.' }
    ];
  }

  async performBulkOrderUpdates(orderUpdates) {
    console.log(`Processing ${orderUpdates.length} order updates...`);

    const config = this.batchConfigurations.get('order_updates');
    const collection = this.db.collection('orders');
    const bulkOps = [];

    // Create update operations for orders
    for (const update of orderUpdates) {
      const operation = {
        updateOne: {
          filter: { orderId: update.orderId },
          update: {
            $set: {
              status: update.status,
              updatedAt: new Date(),
              updatedBy: update.updatedBy || 'system',

              // Track status history
              $push: {
                statusHistory: {
                  status: update.status,
                  timestamp: new Date(),
                  updatedBy: update.updatedBy || 'system',
                  reason: update.reason || 'bulk_update'
                }
              }
            }
          },
          upsert: false
        }
      };

      // Add conditional updates based on status
      if (update.status === 'shipped') {
        operation.updateOne.update.$set.shippedAt = new Date();
        operation.updateOne.update.$set.trackingNumber = update.trackingNumber;
        operation.updateOne.update.$set.carrier = update.carrier;
      } else if (update.status === 'delivered') {
        operation.updateOne.update.$set.deliveredAt = new Date();
        operation.updateOne.update.$set.deliveryConfirmation = update.deliveryConfirmation;
      }

      bulkOps.push(operation);
    }

    return await this.executeBulkOperations(collection, bulkOps, config, {
      startTime: Date.now(),
      totalRecords: orderUpdates.length,
      processedRecords: 0,
      successfulOperations: 0,
      failedOperations: 0,
      errors: [],
      batchTimes: []
    });
  }

  async performAnalyticsBatchProcessing(analyticsData) {
    console.log(`Processing ${analyticsData.length} analytics records...`);

    const config = this.batchConfigurations.get('analytics_batch');
    const collection = this.db.collection('analytics_events');

    // Transform analytics data for bulk insert
    const bulkOps = analyticsData.map(event => ({
      insertOne: {
        document: {
          eventType: event.type,
          userId: event.userId,
          sessionId: event.sessionId,
          timestamp: new Date(event.timestamp),

          // Event properties
          properties: event.properties || {},

          // User context
          userAgent: event.userAgent,
          ipAddress: event.ipAddress,

          // Page/app context
          page: {
            url: event.pageUrl,
            title: event.pageTitle,
            referrer: event.referrer
          },

          // Device and browser info
          device: event.device || {},
          browser: event.browser || {},

          // Geographic data
          geo: event.geo || {},

          // Processing metadata
          processedAt: new Date(),
          batchId: this.generateBatchId()
        }
      }
    }));

    return await this.executeBulkOperations(collection, bulkOps, config, {
      startTime: Date.now(),
      totalRecords: analyticsData.length,
      processedRecords: 0,
      successfulOperations: 0,
      failedOperations: 0,
      errors: [],
      batchTimes: []
    });
  }

  async getBulkOperationStats() {
    const collections = ['products', 'orders', 'analytics_events'];
    const stats = {};

    for (const collectionName of collections) {
      const collection = this.db.collection(collectionName);

      // Get collection statistics
      const collStats = await this.db.command({ collStats: collectionName });

      // Get recent bulk operation metrics
      const recentBulkOps = await collection.aggregate([
        {
          $match: {
            'audit.importBatch': { $exists: true },
            'audit.createdAt': { 
              $gte: new Date(Date.now() - 24 * 60 * 60 * 1000) // Last 24 hours
            }
          }
        },
        {
          $group: {
            _id: '$audit.importBatch',
            count: { $sum: 1 },
            earliestRecord: { $min: '$audit.createdAt' },
            latestRecord: { $max: '$audit.createdAt' },
            dataSource: { $first: '$audit.dataSource' }
          }
        },
        { $sort: { latestRecord: -1 } },
        { $limit: 10 }
      ]).toArray();

      stats[collectionName] = {
        documentCount: collStats.count,
        storageSize: collStats.storageSize,
        indexSize: collStats.totalIndexSize,
        avgDocumentSize: collStats.avgObjSize,
        recentBulkOperations: recentBulkOps
      };
    }

    return {
      timestamp: new Date(),
      collections: stats,
      summary: this.generateStatsummary(stats)
    };
  }

  generateStatsummary(stats) {
    let totalDocuments = 0;
    let totalStorageSize = 0;
    let recentBulkOperations = 0;

    for (const [collectionName, collectionStats] of Object.entries(stats)) {
      totalDocuments += collectionStats.documentCount;
      totalStorageSize += collectionStats.storageSize;
      recentBulkOperations += collectionStats.recentBulkOperations.length;
    }

    return {
      totalDocuments,
      totalStorageSize,
      recentBulkOperations,
      averageDocumentSize: totalDocuments > 0 ? Math.round(totalStorageSize / totalDocuments) : 0,
      performanceIndicators: {
        storageEfficiency: totalStorageSize < (totalDocuments * 1000) ? 'excellent' : 'review',
        bulkOperationActivity: recentBulkOperations > 5 ? 'high' : 'normal'
      }
    };
  }

  async shutdown() {
    console.log('Shutting down MongoDB Bulk Operations Manager...');

    if (this.client) {
      await this.client.close();
      console.log('✅ MongoDB connection closed');
    }

    this.operationMetrics.clear();
    this.errorHandlers.clear();
    this.batchConfigurations.clear();
  }
}

// Export the bulk operations manager
module.exports = { MongoBulkOperationsManager };

// Benefits of MongoDB Bulk Operations:
// - Native batch processing eliminates individual operation overhead
// - Intelligent batching with configurable size and ordering options
// - Built-in error handling with detailed failure reporting and retry logic
// - High-performance throughput optimization for large-scale data processing
// - Atomic batch operations with comprehensive transaction support
// - Advanced monitoring and metrics tracking for performance analysis
// - Flexible operation types supporting mixed insert/update/delete operations
// - Production-ready error recovery and partial failure handling
// - Comprehensive performance recommendations and optimization guidance
// - SQL-compatible batch processing patterns through QueryLeaf integration

Understanding MongoDB Bulk Operations Architecture

Advanced Bulk Processing Patterns

Implement sophisticated bulk operation strategies for different data processing scenarios:

// Advanced bulk operations patterns for enterprise data processing
class EnterpriseBulkProcessor {
  constructor() {
    this.processingQueues = new Map();
    this.operationStrategies = new Map();
    this.performanceProfiler = new Map();
    this.errorRecoveryHandlers = new Map();
  }

  async initializeProcessingStrategies() {
    console.log('Initializing enterprise bulk processing strategies...');

    // Define processing strategies for different data types
    const strategies = {
      // High-frequency transactional updates
      'financial_transactions': {
        processingMode: 'ordered_atomic',
        batchSize: 100,
        maxConcurrency: 1,
        errorTolerance: 'zero',
        consistencyLevel: 'strong',

        validation: {
          strict: true,
          schemaValidation: true,
          businessRules: true,
          auditLogging: true
        },

        performance: {
          priorityLevel: 'critical',
          maxLatencyMs: 1000,
          throughputTarget: 500
        }
      },

      // Bulk inventory synchronization
      'inventory_sync': {
        processingMode: 'unordered_parallel',
        batchSize: 1000,
        maxConcurrency: 5,
        errorTolerance: 'partial',
        consistencyLevel: 'eventual',

        validation: {
          strict: false,
          schemaValidation: true,
          businessRules: false,
          auditLogging: false
        },

        performance: {
          priorityLevel: 'high',
          maxLatencyMs: 5000,
          throughputTarget: 2000
        }
      },

      // Analytics data ingestion
      'analytics_ingestion': {
        processingMode: 'unordered_parallel',
        batchSize: 5000,
        maxConcurrency: 10,
        errorTolerance: 'high',
        consistencyLevel: 'relaxed',

        validation: {
          strict: false,
          schemaValidation: false,
          businessRules: false,
          auditLogging: false
        },

        performance: {
          priorityLevel: 'normal',
          maxLatencyMs: 30000,
          throughputTarget: 10000
        }
      },

      // Customer data migration
      'customer_migration': {
        processingMode: 'ordered_atomic',
        batchSize: 200,
        maxConcurrency: 2,
        errorTolerance: 'low',
        consistencyLevel: 'strong',

        validation: {
          strict: true,
          schemaValidation: true,
          businessRules: true,
          auditLogging: true
        },

        performance: {
          priorityLevel: 'high',
          maxLatencyMs: 10000,
          throughputTarget: 100
        }
      }
    };

    for (const [strategyName, strategy] of Object.entries(strategies)) {
      this.operationStrategies.set(strategyName, strategy);
    }

    console.log('✅ Processing strategies initialized');
  }

  async processDataWithStrategy(strategyName, data, options = {}) {
    const strategy = this.operationStrategies.get(strategyName);
    if (!strategy) {
      throw new Error(`Unknown processing strategy: ${strategyName}`);
    }

    console.log(`Processing ${data.length} records with ${strategyName} strategy...`);

    const processingContext = {
      strategyName,
      strategy,
      startTime: Date.now(),
      totalRecords: data.length,
      processedRecords: 0,
      successfulRecords: 0,
      failedRecords: 0,
      errors: [],
      performanceMetrics: {
        batchTimes: [],
        throughputMeasurements: [],
        memoryUsage: []
      }
    };

    try {
      // Apply strategy-specific processing
      switch (strategy.processingMode) {
        case 'ordered_atomic':
          return await this.processOrderedAtomic(data, strategy, processingContext);
        case 'unordered_parallel':
          return await this.processUnorderedParallel(data, strategy, processingContext);
        default:
          throw new Error(`Unknown processing mode: ${strategy.processingMode}`);
      }

    } catch (error) {
      console.error(`Processing failed for strategy ${strategyName}:`, error);
      return {
        success: false,
        error: error.message,
        context: processingContext
      };
    }
  }

  async processOrderedAtomic(data, strategy, context) {
    console.log('Processing with ordered atomic mode...');

    const collection = this.db.collection(context.collectionName || 'bulk_operations');
    const results = [];

    // Process in sequential batches to maintain order
    for (let i = 0; i < data.length; i += strategy.batchSize) {
      const batchStart = Date.now();
      const batch = data.slice(i, i + strategy.batchSize);

      try {
        // Start transaction for atomic processing
        const session = this.client.startSession();

        await session.withTransaction(async () => {
          const bulkOps = batch.map(record => this.createBulkOperation(record, strategy));

          const result = await collection.bulkWrite(bulkOps, {
            ordered: true,
            session: session,
            writeConcern: { w: 'majority', j: true } // Strong consistency
          });

          context.successfulRecords += result.insertedCount + result.modifiedCount;
          context.processedRecords += batch.length;

          results.push({
            batchIndex: Math.floor(i / strategy.batchSize),
            result: result,
            processingTimeMs: Date.now() - batchStart
          });
        });

        await session.endSession();

        // Performance monitoring for ordered processing
        const batchTime = Date.now() - batchStart;
        context.performanceMetrics.batchTimes.push(batchTime);

        if (batchTime > strategy.performance.maxLatencyMs) {
          console.warn(`Batch latency ${batchTime}ms exceeds target ${strategy.performance.maxLatencyMs}ms`);
        }

      } catch (error) {
        console.error(`Ordered batch failed at index ${i}:`, error);
        context.failedRecords += batch.length;
        context.errors.push({
          batchIndex: Math.floor(i / strategy.batchSize),
          error: error.message,
          recordCount: batch.length
        });

        // For zero error tolerance, stop processing
        if (strategy.errorTolerance === 'zero') {
          throw new Error(`Processing stopped due to zero error tolerance: ${error.message}`);
        }
      }
    }

    return {
      success: true,
      results: results,
      context: context
    };
  }

  async processUnorderedParallel(data, strategy, context) {
    console.log(`Processing with unordered parallel mode (${strategy.maxConcurrency} concurrent batches)...`);

    const collection = this.db.collection(context.collectionName || 'bulk_operations');
    const results = [];
    const concurrentPromises = [];

    // Create batches for parallel processing
    const batches = [];
    for (let i = 0; i < data.length; i += strategy.batchSize) {
      batches.push({
        index: Math.floor(i / strategy.batchSize),
        data: data.slice(i, i + strategy.batchSize)
      });
    }

    // Process batches with controlled concurrency
    for (let i = 0; i < batches.length; i += strategy.maxConcurrency) {
      const concurrentBatches = batches.slice(i, i + strategy.maxConcurrency);

      const batchPromises = concurrentBatches.map(async (batch) => {
        const batchStart = Date.now();

        try {
          const bulkOps = batch.data.map(record => this.createBulkOperation(record, strategy));

          const result = await collection.bulkWrite(bulkOps, {
            ordered: false, // Allow parallel processing within batch
            writeConcern: { w: 1, j: false } // Optimized for throughput
          });

          context.successfulRecords += result.insertedCount + result.modifiedCount;
          context.processedRecords += batch.data.length;

          return {
            batchIndex: batch.index,
            result: result,
            processingTimeMs: Date.now() - batchStart,
            throughputPerSecond: Math.round((batch.data.length / (Date.now() - batchStart)) * 1000)
          };

        } catch (error) {
          context.failedRecords += batch.data.length;
          context.errors.push({
            batchIndex: batch.index,
            error: error.message,
            recordCount: batch.data.length
          });

          return {
            batchIndex: batch.index,
            error: error.message,
            recordCount: batch.data.length
          };
        }
      });

      // Wait for current batch of concurrent operations
      const batchResults = await Promise.all(batchPromises);
      results.push(...batchResults);

      // Track performance metrics
      const successfulBatches = batchResults.filter(r => !r.error);
      if (successfulBatches.length > 0) {
        const avgThroughput = successfulBatches.reduce((sum, r) => sum + r.throughputPerSecond, 0) / successfulBatches.length;
        context.performanceMetrics.throughputMeasurements.push(avgThroughput);
      }
    }

    return {
      success: true,
      results: results,
      context: context
    };
  }

  createBulkOperation(record, strategy) {
    // Apply validation based on strategy
    if (strategy.validation.strict) {
      this.validateRecord(record, strategy);
    }

    // Transform record based on operation type
    const transformedRecord = this.transformRecord(record, strategy);

    // Return appropriate bulk operation
    if (record.operationType === 'update') {
      return {
        updateOne: {
          filter: { _id: record._id },
          update: { $set: transformedRecord },
          upsert: record.upsert || false
        }
      };
    } else {
      return {
        insertOne: {
          document: transformedRecord
        }
      };
    }
  }

  validateRecord(record, strategy) {
    // Implement validation logic based on strategy
    if (strategy.validation.schemaValidation) {
      // Schema validation logic
    }

    if (strategy.validation.businessRules) {
      // Business rules validation
    }

    return true;
  }

  transformRecord(record, strategy) {
    // Apply transformations based on strategy
    const transformed = { ...record };

    // Add audit fields based on strategy requirements
    if (strategy.validation.auditLogging) {
      transformed.audit = {
        processedAt: new Date(),
        strategy: strategy,
        version: 1
      };
    }

    return transformed;
  }

  async getProcessingMetrics() {
    const metrics = {
      timestamp: new Date(),
      strategies: {}
    };

    for (const [strategyName, strategy] of this.operationStrategies) {
      const performanceProfile = this.performanceProfiler.get(strategyName);

      metrics.strategies[strategyName] = {
        configuration: strategy,
        performance: performanceProfile || {
          avgThroughput: 0,
          avgLatency: 0,
          errorRate: 0,
          lastUsed: null
        }
      };
    }

    return metrics;
  }
}

// Export the enterprise bulk processor
module.exports = { EnterpriseBulkProcessor };

SQL-Style Bulk Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB bulk operations and batch processing:

-- QueryLeaf bulk operations with SQL-familiar syntax

-- Bulk insert operations with performance optimization
BULK INSERT INTO products 
SELECT 
  sku,
  name,
  description,

  -- Pricing structure
  price as "pricing.basePrice",
  cost_price as "pricing.costPrice",
  margin_percent as "pricing.marginPercent",
  currency as "pricing.currency",

  -- Inventory management
  stock_quantity as "inventory.stockQuantity",
  minimum_order_quantity as "inventory.minimumOrderQuantity",
  track_inventory as "inventory.trackInventory",

  -- Product specifications
  weight_kg as "specifications.weight.value",
  'kg' as "specifications.weight.unit",
  JSON_PARSE(dimensions_json) as "specifications.dimensions",

  -- Classification
  category_id as "classification.categoryId",
  STRING_SPLIT(category_path, '/') as "classification.categoryPath",
  STRING_SPLIT(tags, ',') as "classification.tags",

  -- Audit information
  CURRENT_TIMESTAMP as "audit.createdAt",
  'bulk_import_system' as "audit.createdBy",
  CONCAT('batch_', DATE_FORMAT(NOW(), '%Y%m%d_%H%i%s')) as "audit.importBatch"

FROM product_import_staging
WHERE validation_status = 'passed'
WITH BULK_OPTIONS (
  batch_size = 1000,
  ordered = false,
  timeout_seconds = 300,
  retry_attempts = 3
);

-- Bulk update operations with conditional logic
BULK UPDATE orders 
SET 
  status = staging.new_status,
  updated_at = CURRENT_TIMESTAMP,
  updated_by = 'bulk_update_system',

  -- Conditional updates based on status
  shipped_at = CASE 
    WHEN staging.new_status = 'shipped' THEN CURRENT_TIMESTAMP 
    ELSE shipped_at 
  END,

  tracking_number = CASE 
    WHEN staging.new_status = 'shipped' THEN staging.tracking_number 
    ELSE tracking_number 
  END,

  delivered_at = CASE 
    WHEN staging.new_status = 'delivered' THEN CURRENT_TIMESTAMP 
    ELSE delivered_at 
  END,

  -- Array operations for status history
  $PUSH = {
    "status_history": {
      "status": staging.new_status,
      "timestamp": CURRENT_TIMESTAMP,
      "updated_by": 'bulk_update_system',
      "reason": COALESCE(staging.reason, 'bulk_update')
    }
  }

FROM order_status_updates staging
WHERE orders.order_id = staging.order_id
  AND orders.status != staging.new_status
WITH BULK_OPTIONS (
  batch_size = 500,
  ordered = true,
  upsert = false
);

-- Bulk upsert operations (insert or update)
BULK UPSERT INTO customer_profiles
SELECT 
  customer_id,
  email,
  first_name,
  last_name,

  -- Contact information
  phone,
  JSON_OBJECT(
    'street', street_address,
    'city', city,
    'state', state,
    'postal_code', postal_code,
    'country', country
  ) as address,

  -- Preferences and segmentation
  marketing_preferences,
  customer_segment,
  lifetime_value,

  -- Behavioral data
  last_purchase_date,
  total_orders,
  average_order_value,

  -- Timestamps
  COALESCE(existing_created_at, CURRENT_TIMESTAMP) as created_at,
  CURRENT_TIMESTAMP as updated_at

FROM customer_data_import cdi
LEFT JOIN customer_profiles existing 
  ON existing.customer_id = cdi.customer_id
WITH BULK_OPTIONS (
  batch_size = 750,
  ordered = false,
  match_fields = ['customer_id'],
  upsert = true
);

-- Bulk delete operations with conditions
BULK DELETE FROM expired_sessions
WHERE last_activity < DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 30 DAY)
  AND session_status = 'inactive'
WITH BULK_OPTIONS (
  batch_size = 2000,
  ordered = false,
  confirm_delete = true
);

-- Advanced bulk operations with aggregation
WITH aggregated_metrics AS (
  SELECT 
    user_id,
    DATE_TRUNC('day', event_timestamp) as event_date,

    -- Event aggregations
    COUNT(*) as total_events,
    COUNT(DISTINCT session_id) as unique_sessions,
    SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) as purchase_events,
    SUM(CASE WHEN event_type = 'page_view' THEN 1 ELSE 0 END) as page_view_events,

    -- Value calculations
    SUM(COALESCE(event_value, 0)) as total_event_value,
    AVG(COALESCE(session_duration_seconds, 0)) as avg_session_duration,

    -- Behavioral insights
    MAX(event_timestamp) as last_activity,
    MIN(event_timestamp) as first_activity,

    -- Engagement scoring
    CASE 
      WHEN COUNT(*) > 100 THEN 'high_engagement'
      WHEN COUNT(*) > 20 THEN 'medium_engagement'
      ELSE 'low_engagement'
    END as engagement_level

  FROM user_events
  WHERE event_timestamp >= DATE_SUB(CURRENT_DATE, INTERVAL 7 DAY)
  GROUP BY user_id, DATE_TRUNC('day', event_timestamp)
)

BULK INSERT INTO daily_user_metrics
SELECT 
  user_id,
  event_date,
  total_events,
  unique_sessions,
  purchase_events,
  page_view_events,
  total_event_value,
  avg_session_duration,
  last_activity,
  first_activity,
  engagement_level,

  -- Computed metrics
  ROUND(total_event_value / NULLIF(total_events, 0), 2) as avg_event_value,
  ROUND(page_view_events::DECIMAL / NULLIF(unique_sessions, 0), 2) as pages_per_session,

  -- Processing metadata
  CURRENT_TIMESTAMP as computed_at,
  'daily_aggregation' as metric_source

FROM aggregated_metrics
WITH BULK_OPTIONS (
  batch_size = 1500,
  ordered = false,
  on_conflict = 'replace'
);

-- Bulk operations performance monitoring
SELECT 
  operation_type,
  collection_name,
  batch_size,

  -- Performance metrics
  COUNT(*) as total_operations,
  AVG(records_processed) as avg_records_per_operation,
  AVG(processing_time_ms) as avg_processing_time_ms,

  -- Throughput calculations
  SUM(records_processed) as total_records_processed,
  SUM(processing_time_ms) as total_processing_time_ms,
  ROUND(
    SUM(records_processed)::DECIMAL / (SUM(processing_time_ms) / 1000),
    2
  ) as overall_throughput_per_second,

  -- Success rates
  AVG(success_rate) as avg_success_rate,
  COUNT(*) FILTER (WHERE success_rate = 100) as fully_successful_operations,
  COUNT(*) FILTER (WHERE success_rate < 95) as problematic_operations,

  -- Error analysis
  AVG(error_count) as avg_errors_per_operation,
  SUM(error_count) as total_errors,

  -- Resource utilization
  AVG(memory_usage_mb) as avg_memory_usage_mb,
  MAX(memory_usage_mb) as peak_memory_usage_mb,

  -- Time-based analysis
  MIN(operation_start_time) as earliest_operation,
  MAX(operation_end_time) as latest_operation,

  -- Performance assessment
  CASE 
    WHEN AVG(processing_time_ms) < 1000 AND AVG(success_rate) >= 99 THEN 'excellent'
    WHEN AVG(processing_time_ms) < 5000 AND AVG(success_rate) >= 95 THEN 'good'
    WHEN AVG(processing_time_ms) < 10000 AND AVG(success_rate) >= 90 THEN 'acceptable'
    ELSE 'needs_optimization'
  END as performance_rating,

  -- Optimization recommendations
  CASE 
    WHEN AVG(processing_time_ms) > 10000 THEN 'Consider reducing batch size or optimizing queries'
    WHEN AVG(success_rate) < 95 THEN 'Review error handling and data validation'
    WHEN AVG(memory_usage_mb) > 500 THEN 'Optimize document size or batch processing'
    WHEN overall_throughput_per_second < 100 THEN 'Consider increasing batch size or parallelization'
    ELSE 'Performance within acceptable parameters'
  END as optimization_recommendation

FROM bulk_operation_logs
WHERE operation_start_time >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
GROUP BY operation_type, collection_name, batch_size
ORDER BY total_records_processed DESC;

-- Bulk operation capacity planning
CREATE VIEW bulk_operations_capacity_planning AS
WITH hourly_bulk_metrics AS (
  SELECT 
    DATE_TRUNC('hour', operation_start_time) as hour_bucket,
    operation_type,

    -- Volume metrics
    COUNT(*) as operations_per_hour,
    SUM(records_processed) as records_per_hour,
    AVG(records_processed) as avg_records_per_operation,

    -- Performance metrics
    AVG(processing_time_ms) as avg_processing_time_ms,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY processing_time_ms) as p95_processing_time_ms,

    -- Throughput calculations
    ROUND(
      SUM(records_processed)::DECIMAL / (SUM(processing_time_ms) / 1000),
      2
    ) as throughput_per_second,

    -- Resource usage
    AVG(memory_usage_mb) as avg_memory_usage_mb,
    MAX(memory_usage_mb) as peak_memory_usage_mb,

    -- Success metrics
    AVG(success_rate) as avg_success_rate,
    SUM(error_count) as errors_per_hour

  FROM bulk_operation_logs
  WHERE operation_start_time >= CURRENT_TIMESTAMP - INTERVAL '7 days'
  GROUP BY DATE_TRUNC('hour', operation_start_time), operation_type
),

capacity_trends AS (
  SELECT 
    *,
    -- Trend analysis
    LAG(throughput_per_second) OVER (PARTITION BY operation_type ORDER BY hour_bucket) as prev_throughput,
    LAG(records_per_hour) OVER (PARTITION BY operation_type ORDER BY hour_bucket) as prev_records_per_hour,

    -- Growth calculations
    CASE 
      WHEN LAG(throughput_per_second) OVER (PARTITION BY operation_type ORDER BY hour_bucket) IS NOT NULL
      THEN ROUND(
        (throughput_per_second - LAG(throughput_per_second) OVER (PARTITION BY operation_type ORDER BY hour_bucket)) / 
        LAG(throughput_per_second) OVER (PARTITION BY operation_type ORDER BY hour_bucket) * 100,
        2
      )
      ELSE 0
    END as throughput_change_percent

  FROM hourly_bulk_metrics
)

SELECT 
  operation_type,
  TO_CHAR(hour_bucket, 'YYYY-MM-DD HH24:00') as analysis_hour,

  -- Volume analysis
  operations_per_hour,
  records_per_hour,
  ROUND(avg_records_per_operation::NUMERIC, 0) as avg_records_per_operation,

  -- Performance analysis
  ROUND(avg_processing_time_ms::NUMERIC, 0) as avg_processing_time_ms,
  ROUND(p95_processing_time_ms::NUMERIC, 0) as p95_processing_time_ms,
  throughput_per_second,

  -- Resource analysis
  ROUND(avg_memory_usage_mb::NUMERIC, 1) as avg_memory_usage_mb,
  ROUND(peak_memory_usage_mb::NUMERIC, 1) as peak_memory_usage_mb,

  -- Quality metrics
  ROUND(avg_success_rate::NUMERIC, 2) as avg_success_rate_pct,
  errors_per_hour,

  -- Trend analysis
  throughput_change_percent,

  -- Capacity assessment
  CASE 
    WHEN throughput_per_second > 1000 AND avg_success_rate >= 99 THEN 'optimal_capacity'
    WHEN throughput_per_second > 500 AND avg_success_rate >= 95 THEN 'good_capacity'
    WHEN throughput_per_second > 100 AND avg_success_rate >= 90 THEN 'adequate_capacity'
    ELSE 'capacity_constraints'
  END as capacity_status,

  -- Scaling recommendations
  CASE 
    WHEN throughput_change_percent > 50 THEN 'Monitor for sustained growth - consider scaling'
    WHEN throughput_change_percent < -25 THEN 'Throughput declining - investigate performance'
    WHEN peak_memory_usage_mb > 1000 THEN 'High memory usage - optimize batch sizes'
    WHEN errors_per_hour > 100 THEN 'High error rate - review data quality'
    ELSE 'Capacity planning within normal parameters'
  END as scaling_recommendation

FROM capacity_trends
WHERE hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
ORDER BY operation_type, hour_bucket DESC;

-- QueryLeaf provides comprehensive bulk operations capabilities:
-- 1. SQL-familiar syntax for MongoDB bulk insert, update, upsert, and delete operations
-- 2. Advanced batch processing with configurable performance options
-- 3. Real-time performance monitoring with throughput and success rate tracking
-- 4. Intelligent error handling with retry logic and partial failure recovery
-- 5. Capacity planning with trend analysis and scaling recommendations
-- 6. Resource utilization monitoring with memory and processing time analysis
-- 7. Flexible operation strategies for different data processing scenarios
-- 8. Production-ready bulk processing with comprehensive audit and logging
-- 9. Integration with MongoDB's native bulk write capabilities
-- 10. Enterprise-grade batch operations accessible through familiar SQL constructs

Best Practices for MongoDB Bulk Operations Implementation

Bulk Operation Optimization Strategies

Essential practices for maximizing bulk operation performance:

Batch Size Optimization: Configure optimal batch sizes based on document size and available memory
Operation Ordering: Choose ordered vs unordered operations based on consistency requirements
Write Concern Tuning: Balance durability requirements with throughput needs
Error Handling Strategy: Implement comprehensive error handling with retry logic
Memory Management: Monitor memory usage and optimize document structures
Performance Monitoring: Track throughput, latency, and error rates continuously

Production Deployment Considerations

Key factors for enterprise bulk operation deployments:

Concurrency Control: Implement proper concurrency limits to prevent resource exhaustion
Progress Tracking: Provide comprehensive progress reporting for long-running operations
Atomic Transactions: Use transactions for operations requiring strong consistency
Failover Handling: Design operations to handle replica set failovers gracefully
Resource Scaling: Plan for dynamic resource scaling based on processing demands
Data Validation: Implement multi-layer validation to ensure data quality

Conclusion

MongoDB Bulk Write Operations provide enterprise-grade batch processing capabilities that dramatically improve throughput and efficiency for large-scale data operations. The combination of intelligent batching, comprehensive error handling, and performance optimization enables applications to process millions of records efficiently while maintaining data consistency and reliability.

Key MongoDB bulk operations benefits include:

High-Performance Processing: Native bulk operations eliminate individual operation overhead for massive throughput improvements
Intelligent Batching: Configurable batch sizes and ordering options optimized for different processing scenarios
Comprehensive Error Handling: Advanced error recovery with detailed failure reporting and retry logic
Atomic Operations: Transaction support for operations requiring strong consistency guarantees
Performance Monitoring: Real-time metrics and recommendations for continuous optimization
SQL Compatibility: Familiar bulk processing patterns accessible through SQL-style operations

Whether you're importing large datasets, synchronizing with external systems, or performing batch transformations, MongoDB bulk operations with QueryLeaf's SQL-familiar interface provide the foundation for scalable data processing that maintains high performance while simplifying operational complexity.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB bulk operations while providing SQL-familiar syntax for batch insert, update, upsert, and delete operations. Advanced error handling, performance monitoring, and capacity planning are seamlessly accessible through familiar SQL constructs, making sophisticated bulk data processing both powerful and approachable for SQL-oriented teams.

The combination of MongoDB's high-performance bulk operations with familiar SQL-style management makes it an ideal platform for applications that require both massive data processing capabilities and operational simplicity, ensuring your batch processing infrastructure scales efficiently while maintaining familiar development and operational patterns.

November 17, 2025
21 min read

MongoDB Connection Pooling and Performance Optimization: Production-Scale Database Connection Management and High-Performance Application Design

High-performance applications require efficient database connection management to handle concurrent user requests, maintain low latency, and scale effectively under varying load conditions. Poor connection management leads to resource exhaustion, application timeouts, and degraded user experience, particularly in microservices architectures where multiple services compete for database connections.

MongoDB connection pooling provides sophisticated connection lifecycle management with intelligent pool sizing, connection health monitoring, and automatic failover capabilities. Unlike traditional databases that require complex external connection pooling solutions, MongoDB drivers include built-in connection pool management that integrates seamlessly with MongoDB's replica sets and sharded clusters.

The Traditional Connection Management Challenge

Managing database connections efficiently in traditional environments requires complex infrastructure and careful tuning:

-- Traditional PostgreSQL connection management - complex and resource-intensive

-- Connection monitoring and management requires external tools
-- PostgreSQL connection stats (requires pg_stat_activity monitoring)
SELECT 
    pid,
    usename as username,
    application_name,
    client_addr,
    state,
    state_change,

    -- Connection timing
    backend_start,
    query_start,
    EXTRACT(EPOCH FROM (NOW() - backend_start)) as connection_age_seconds,
    EXTRACT(EPOCH FROM (NOW() - query_start)) as query_duration_seconds,

    -- Current activity
    query,
    wait_event_type,
    wait_event,

    -- Resource usage estimation
    CASE 
        WHEN state = 'active' THEN 'high'
        WHEN state = 'idle in transaction' THEN 'medium'
        WHEN state = 'idle' THEN 'low'
        ELSE 'unknown'
    END as resource_impact

FROM pg_stat_activity
WHERE pid != pg_backend_pid() -- Exclude current monitoring connection
ORDER BY backend_start DESC;

-- PostgreSQL connection limits and configuration
-- Must be configured at database server level in postgresql.conf:
-- max_connections = 200          -- Hard limit on concurrent connections
-- shared_buffers = 256MB         -- Memory allocation affects connection overhead
-- effective_cache_size = 4GB     -- Query planning memory estimates
-- work_mem = 4MB                 -- Per-operation memory limit
-- maintenance_work_mem = 256MB   -- Maintenance operation memory

-- Application-level connection pooling setup (using PgBouncer configuration)
-- /etc/pgbouncer/pgbouncer.ini configuration file:
-- [databases]
-- production_db = host=localhost port=5432 dbname=production user=app_user
-- 
-- [pgbouncer]  
-- pool_mode = transaction        -- Connection reuse strategy
-- max_client_conn = 1000         -- Maximum client connections
-- default_pool_size = 20         -- Connections per database/user
-- min_pool_size = 5              -- Minimum maintained connections
-- reserve_pool_size = 5          -- Emergency connection reserve
-- max_db_connections = 50        -- Maximum per database
-- 
-- # Connection lifecycle settings
-- server_lifetime = 3600         -- Server connection max age (seconds)
-- server_idle_timeout = 600      -- Idle server connection timeout
-- client_idle_timeout = 0        -- Client idle timeout (0 = disabled)
-- 
-- # Performance tuning
-- listen_backlog = 128           -- TCP listen queue size
-- server_connect_timeout = 15    -- Server connection timeout
-- server_login_retry = 15        -- Login retry interval

-- Complex connection pool monitoring query
WITH connection_stats AS (
    SELECT 
        datname as database_name,
        usename as username,
        application_name,
        client_addr,
        state,

        -- Connection lifecycle analysis
        CASE 
            WHEN backend_start > NOW() - INTERVAL '1 minute' THEN 'new'
            WHEN backend_start > NOW() - INTERVAL '5 minutes' THEN 'recent'
            WHEN backend_start > NOW() - INTERVAL '30 minutes' THEN 'established'
            ELSE 'long_running'
        END as connection_age_category,

        -- Query activity analysis
        CASE 
            WHEN query_start IS NULL THEN 'no_query_executed'
            WHEN query_start > NOW() - INTERVAL '1 second' THEN 'active_query'
            WHEN query_start > NOW() - INTERVAL '30 seconds' THEN 'recent_query'
            ELSE 'old_query'
        END as query_activity_category,

        -- Resource usage indicators
        CASE 
            WHEN wait_event_type = 'Lock' THEN 'blocking'
            WHEN wait_event_type = 'IO' THEN 'io_intensive'
            WHEN state = 'idle in transaction' THEN 'transaction_holding'
            WHEN state = 'active' THEN 'cpu_intensive'
            ELSE 'idle'
        END as resource_usage_pattern,

        EXTRACT(EPOCH FROM (NOW() - backend_start)) as connection_duration_seconds,
        EXTRACT(EPOCH FROM (NOW() - query_start)) as current_query_seconds

    FROM pg_stat_activity
    WHERE pid != pg_backend_pid()
),

connection_summary AS (
    SELECT 
        database_name,
        state,
        connection_age_category,
        query_activity_category,
        resource_usage_pattern,

        -- Connection counts by category
        COUNT(*) as connection_count,
        AVG(connection_duration_seconds) as avg_connection_age,
        MAX(connection_duration_seconds) as max_connection_age,
        AVG(current_query_seconds) as avg_query_duration,
        MAX(current_query_seconds) as max_query_duration,

        -- Resource impact assessment
        COUNT(*) FILTER (WHERE resource_usage_pattern = 'blocking') as blocking_connections,
        COUNT(*) FILTER (WHERE resource_usage_pattern = 'io_intensive') as io_intensive_connections,
        COUNT(*) FILTER (WHERE resource_usage_pattern = 'transaction_holding') as transaction_holding_connections,
        COUNT(*) FILTER (WHERE state = 'idle') as idle_connections,
        COUNT(*) FILTER (WHERE state = 'active') as active_connections

    FROM connection_stats
    GROUP BY database_name, state, connection_age_category, query_activity_category, resource_usage_pattern
),

database_connection_health AS (
    SELECT 
        database_name,
        SUM(connection_count) as total_connections,
        SUM(active_connections) as total_active,
        SUM(idle_connections) as total_idle,
        SUM(blocking_connections) as total_blocking,

        -- Health indicators
        ROUND(AVG(avg_connection_age)::numeric, 2) as avg_connection_age_seconds,
        ROUND(MAX(max_connection_age)::numeric, 2) as oldest_connection_seconds,
        ROUND(AVG(avg_query_duration)::numeric, 2) as avg_query_duration_seconds,

        -- Connection efficiency ratio
        ROUND(
            (SUM(active_connections)::decimal / NULLIF(SUM(connection_count), 0) * 100)::numeric, 
            2
        ) as connection_utilization_percent,

        -- Performance assessment
        CASE 
            WHEN SUM(blocking_connections) > 5 THEN 'critical_blocking'
            WHEN SUM(connection_count) > 150 THEN 'high_connection_count'
            WHEN AVG(avg_query_duration) > 30 THEN 'slow_queries'
            WHEN SUM(idle_connections)::decimal / SUM(connection_count) > 0.7 THEN 'excessive_idle'
            ELSE 'healthy'
        END as connection_health_status

    FROM connection_summary
    GROUP BY database_name
)

SELECT 
    database_name,
    total_connections,
    total_active,
    total_idle,
    connection_utilization_percent,

    -- Performance indicators
    avg_connection_age_seconds,
    avg_query_duration_seconds,
    connection_health_status,

    -- Recommendations
    CASE connection_health_status
        WHEN 'critical_blocking' THEN 'Investigate blocking queries and deadlocks'
        WHEN 'high_connection_count' THEN 'Consider connection pooling or scaling'
        WHEN 'slow_queries' THEN 'Optimize query performance and indexing'
        WHEN 'excessive_idle' THEN 'Tune connection pool idle timeout settings'
        ELSE 'Connection pool operating within normal parameters'
    END as optimization_recommendation,

    -- Connection pool sizing recommendations
    CASE 
        WHEN connection_utilization_percent > 90 THEN 'Increase pool size'
        WHEN connection_utilization_percent < 30 THEN 'Reduce pool size'
        ELSE 'Pool size appropriate'
    END as pool_sizing_recommendation

FROM database_connection_health
ORDER BY total_connections DESC;

-- Problems with traditional connection management:
-- 1. External connection pooler configuration and maintenance complexity
-- 2. Limited visibility into connection pool health and performance metrics
-- 3. Manual tuning of pool sizes and timeout configurations
-- 4. Complex failover and high availability connection management
-- 5. Difficulty coordinating connection pools across multiple application instances
-- 6. Limited integration with database cluster topology changes
-- 7. Resource overhead from maintaining separate connection pooling infrastructure
-- 8. Complex monitoring and alerting setup for connection pool health
-- 9. Manual configuration management across different environments
-- 10. Poor integration with modern microservices and containerized deployments

MongoDB provides native connection pooling with intelligent management and monitoring:

// MongoDB Connection Pooling - native high-performance connection management
const { MongoClient } = require('mongodb');

// Advanced MongoDB Connection Pool Manager
class MongoConnectionPoolManager {
  constructor() {
    this.clients = new Map();
    this.poolConfigurations = new Map();
    this.connectionMetrics = new Map();
    this.healthCheckIntervals = new Map();
  }

  async createOptimizedConnectionPools() {
    console.log('Creating optimized MongoDB connection pools...');

    // Production connection pool configuration
    const productionPoolConfig = {
      // Core pool sizing
      minPoolSize: 5,           // Minimum connections maintained
      maxPoolSize: 50,          // Maximum concurrent connections
      maxIdleTimeMS: 30000,     // 30 seconds idle timeout

      // Connection lifecycle management
      maxConnecting: 5,         // Maximum concurrent connection attempts
      serverSelectionTimeoutMS: 30000,  // Server selection timeout
      socketTimeoutMS: 45000,   // Socket operation timeout
      connectTimeoutMS: 10000,  // Initial connection timeout

      // Health monitoring
      heartbeatFrequencyMS: 10000,  // Heartbeat interval
      serverMonitoringMode: 'auto', // Automatic server monitoring

      // Performance optimization
      compressors: ['zlib'],    // Enable compression
      zlibCompressionLevel: 6,  // Compression efficiency

      // Error handling and retries
      retryWrites: true,        // Enable retryable writes
      retryReads: true,         // Enable retryable reads
      maxWriteRetries: 3,       // Maximum write retry attempts

      // Connection security
      tls: true,               // Enable TLS
      tlsInsecure: false,      // Require certificate validation

      // Application metadata
      appName: 'ProductionApp', // Application identifier
      driverInfo: {
        name: 'MongoDB Connection Pool Manager',
        version: '1.0.0'
      }
    };

    // Read-only replica connection pool (for analytics)
    const analyticsPoolConfig = {
      ...productionPoolConfig,
      minPoolSize: 2,
      maxPoolSize: 20,
      readPreference: 'secondary', // Prefer secondary for reads
      readConcern: { level: 'available' }, // Relaxed read concern
      appName: 'AnalyticsApp'
    };

    // High-throughput batch processing pool
    const batchProcessingConfig = {
      ...productionPoolConfig,
      minPoolSize: 10,
      maxPoolSize: 100,         // Higher concurrency for batch jobs
      maxIdleTimeMS: 60000,     // Longer idle timeout
      maxConnecting: 10,        // More concurrent connections
      writeConcern: { w: 1, j: false }, // Optimized write concern
      appName: 'BatchProcessor'
    };

    try {
      // Create primary production client
      const productionClient = new MongoClient(
        process.env.MONGODB_PRODUCTION_URI || 'mongodb://localhost:27017/production',
        productionPoolConfig
      );
      await productionClient.connect();
      this.clients.set('production', productionClient);
      this.poolConfigurations.set('production', productionPoolConfig);

      // Create analytics replica client
      const analyticsClient = new MongoClient(
        process.env.MONGODB_REPLICA_URI || 'mongodb://localhost:27017/production',
        analyticsPoolConfig
      );
      await analyticsClient.connect();
      this.clients.set('analytics', analyticsClient);
      this.poolConfigurations.set('analytics', analyticsPoolConfig);

      // Create batch processing client
      const batchClient = new MongoClient(
        process.env.MONGODB_BATCH_URI || 'mongodb://localhost:27017/production',
        batchProcessingConfig
      );
      await batchClient.connect();
      this.clients.set('batch', batchClient);
      this.poolConfigurations.set('batch', batchProcessingConfig);

      // Initialize connection pool monitoring
      await this.initializePoolMonitoring();

      console.log('✅ MongoDB connection pools created successfully');
      return {
        success: true,
        pools: ['production', 'analytics', 'batch'],
        configurations: Object.fromEntries(this.poolConfigurations)
      };

    } catch (error) {
      console.error('Error creating connection pools:', error);
      return { success: false, error: error.message };
    }
  }

  async initializePoolMonitoring() {
    console.log('Initializing connection pool monitoring...');

    for (const [poolName, client] of this.clients) {
      // Initialize metrics tracking
      this.connectionMetrics.set(poolName, {
        connectionEvents: [],
        performanceMetrics: {
          totalConnections: 0,
          activeConnections: 0,
          availableConnections: 0,
          connectionsCreated: 0,
          connectionsDestroyed: 0,
          operationTime: [],
          errorRate: 0
        },
        healthStatus: 'healthy'
      });

      // Set up connection pool event monitoring
      client.on('connectionPoolCreated', (event) => {
        console.log(`Pool created: ${poolName}`, event);
        this.recordPoolEvent(poolName, 'pool_created', event);
      });

      client.on('connectionCreated', (event) => {
        console.log(`Connection created: ${poolName}`, event.connectionId);
        this.recordPoolEvent(poolName, 'connection_created', event);
        this.updatePoolMetrics(poolName, 'connection_created');
      });

      client.on('connectionReady', (event) => {
        this.recordPoolEvent(poolName, 'connection_ready', event);
      });

      client.on('connectionClosed', (event) => {
        console.log(`Connection closed: ${poolName}`, event.connectionId, event.reason);
        this.recordPoolEvent(poolName, 'connection_closed', event);
        this.updatePoolMetrics(poolName, 'connection_closed');
      });

      client.on('connectionCheckOutStarted', (event) => {
        this.recordPoolEvent(poolName, 'checkout_started', event);
      });

      client.on('connectionCheckedOut', (event) => {
        this.recordPoolEvent(poolName, 'checkout_completed', event);
        this.updatePoolMetrics(poolName, 'checkout_completed');
      });

      client.on('connectionCheckedIn', (event) => {
        this.recordPoolEvent(poolName, 'checkin_completed', event);
        this.updatePoolMetrics(poolName, 'checkin_completed');
      });

      client.on('connectionCheckOutFailed', (event) => {
        console.warn(`Connection checkout failed: ${poolName}`, event.reason);
        this.recordPoolEvent(poolName, 'checkout_failed', event);
        this.updatePoolMetrics(poolName, 'checkout_failed');
      });

      // Server monitoring events
      client.on('serverOpening', (event) => {
        console.log(`Server opening: ${poolName}`, event.address);
        this.recordPoolEvent(poolName, 'server_opening', event);
      });

      client.on('serverClosed', (event) => {
        console.log(`Server closed: ${poolName}`, event.address);
        this.recordPoolEvent(poolName, 'server_closed', event);
      });

      client.on('serverDescriptionChanged', (event) => {
        this.recordPoolEvent(poolName, 'server_description_changed', event);
        this.assessPoolHealth(poolName, event);
      });

      // Set up periodic health checks
      const healthCheckInterval = setInterval(() => {
        this.performPoolHealthCheck(poolName, client);
      }, 30000); // Every 30 seconds

      this.healthCheckIntervals.set(poolName, healthCheckInterval);
    }

    console.log('✅ Connection pool monitoring initialized');
  }

  recordPoolEvent(poolName, eventType, eventData) {
    const metrics = this.connectionMetrics.get(poolName);
    if (metrics) {
      metrics.connectionEvents.push({
        timestamp: new Date(),
        type: eventType,
        data: eventData
      });

      // Keep only last 1000 events to prevent memory leaks
      if (metrics.connectionEvents.length > 1000) {
        metrics.connectionEvents = metrics.connectionEvents.slice(-1000);
      }
    }
  }

  updatePoolMetrics(poolName, eventType) {
    const metrics = this.connectionMetrics.get(poolName);
    if (!metrics) return;

    const performance = metrics.performanceMetrics;

    switch (eventType) {
      case 'connection_created':
        performance.connectionsCreated++;
        performance.totalConnections++;
        break;
      case 'connection_closed':
        performance.connectionsDestroyed++;
        performance.totalConnections = Math.max(0, performance.totalConnections - 1);
        break;
      case 'checkout_completed':
        performance.activeConnections++;
        break;
      case 'checkin_completed':
        performance.activeConnections = Math.max(0, performance.activeConnections - 1);
        break;
      case 'checkout_failed':
        performance.errorRate = (performance.errorRate * 0.95) + 0.05; // Exponential moving average
        break;
    }

    // Calculate available connections
    const poolConfig = this.poolConfigurations.get(poolName);
    if (poolConfig) {
      performance.availableConnections = Math.max(0, 
        Math.min(poolConfig.maxPoolSize, performance.totalConnections) - performance.activeConnections
      );
    }
  }

  async performPoolHealthCheck(poolName, client) {
    try {
      const startTime = Date.now();

      // Perform a lightweight operation to test connectivity
      const db = client.db('admin');
      const result = await db.command({ ping: 1 });

      const operationTime = Date.now() - startTime;

      // Record operation time for performance tracking
      const metrics = this.connectionMetrics.get(poolName);
      if (metrics) {
        metrics.performanceMetrics.operationTime.push({
          timestamp: new Date(),
          duration: operationTime
        });

        // Keep only last 100 operation times
        if (metrics.performanceMetrics.operationTime.length > 100) {
          metrics.performanceMetrics.operationTime = metrics.performanceMetrics.operationTime.slice(-100);
        }

        // Update health status based on performance
        if (operationTime > 5000) {
          metrics.healthStatus = 'degraded';
        } else if (operationTime > 1000) {
          metrics.healthStatus = 'warning';
        } else {
          metrics.healthStatus = 'healthy';
        }
      }

    } catch (error) {
      console.error(`Health check failed for pool ${poolName}:`, error);
      const metrics = this.connectionMetrics.get(poolName);
      if (metrics) {
        metrics.healthStatus = 'unhealthy';
        metrics.performanceMetrics.errorRate = Math.min(1.0, metrics.performanceMetrics.errorRate + 0.1);
      }
    }
  }

  assessPoolHealth(poolName, serverEvent) {
    const metrics = this.connectionMetrics.get(poolName);
    if (!metrics) return;

    const { newDescription } = serverEvent;

    // Assess server health based on description
    if (newDescription.type === 'Unknown' || newDescription.error) {
      metrics.healthStatus = 'unhealthy';
    } else if (newDescription.type === 'RSSecondary' || newDescription.type === 'RSPrimary') {
      metrics.healthStatus = 'healthy';
    }
  }

  async getPoolMetrics(poolName) {
    const metrics = this.connectionMetrics.get(poolName);
    const config = this.poolConfigurations.get(poolName);

    if (!metrics || !config) {
      return { error: `Pool ${poolName} not found` };
    }

    const recentEvents = metrics.connectionEvents.slice(-10); // Last 10 events
    const recentOperationTimes = metrics.performanceMetrics.operationTime.slice(-20); // Last 20 operations

    return {
      poolName: poolName,
      healthStatus: metrics.healthStatus,

      // Connection metrics
      connections: {
        total: metrics.performanceMetrics.totalConnections,
        active: metrics.performanceMetrics.activeConnections,
        available: metrics.performanceMetrics.availableConnections,
        created: metrics.performanceMetrics.connectionsCreated,
        destroyed: metrics.performanceMetrics.connectionsDestroyed,

        // Pool configuration
        minPoolSize: config.minPoolSize,
        maxPoolSize: config.maxPoolSize,
        utilization: metrics.performanceMetrics.totalConnections / config.maxPoolSize
      },

      // Performance metrics
      performance: {
        averageOperationTime: recentOperationTimes.length > 0 ?
          recentOperationTimes.reduce((sum, op) => sum + op.duration, 0) / recentOperationTimes.length : 0,
        errorRate: metrics.performanceMetrics.errorRate,
        recentOperationTimes: recentOperationTimes
      },

      // Recent events
      recentActivity: recentEvents,

      // Health recommendations
      recommendations: this.generatePoolRecommendations(metrics, config)
    };
  }

  generatePoolRecommendations(metrics, config) {
    const recommendations = [];
    const performance = metrics.performanceMetrics;

    // Pool size recommendations
    const utilization = performance.totalConnections / config.maxPoolSize;
    if (utilization > 0.9) {
      recommendations.push({
        type: 'pool_sizing',
        priority: 'high',
        message: 'Pool utilization > 90%. Consider increasing maxPoolSize.',
        suggestedValue: Math.ceil(config.maxPoolSize * 1.5)
      });
    } else if (utilization < 0.3) {
      recommendations.push({
        type: 'pool_sizing',
        priority: 'low',
        message: 'Pool utilization < 30%. Consider reducing maxPoolSize for resource efficiency.',
        suggestedValue: Math.ceil(config.maxPoolSize * 0.7)
      });
    }

    // Performance recommendations
    const avgOpTime = metrics.performanceMetrics.operationTime.length > 0 ?
      metrics.performanceMetrics.operationTime.reduce((sum, op) => sum + op.duration, 0) / 
      metrics.performanceMetrics.operationTime.length : 0;

    if (avgOpTime > 2000) {
      recommendations.push({
        type: 'performance',
        priority: 'high',
        message: 'Average operation time > 2 seconds. Check network latency and server performance.',
        currentValue: Math.round(avgOpTime)
      });
    }

    // Error rate recommendations
    if (performance.errorRate > 0.1) {
      recommendations.push({
        type: 'reliability',
        priority: 'high',
        message: 'High error rate detected. Check server connectivity and timeouts.',
        currentValue: Math.round(performance.errorRate * 100)
      });
    }

    // Health status recommendations
    if (metrics.healthStatus === 'unhealthy') {
      recommendations.push({
        type: 'health',
        priority: 'critical',
        message: 'Pool health is unhealthy. Immediate investigation required.'
      });
    }

    return recommendations.length > 0 ? recommendations : [
      { type: 'status', priority: 'info', message: 'Pool operating within normal parameters.' }
    ];
  }

  async performConnectionLoadTest(poolName, options = {}) {
    console.log(`Performing connection load test for pool: ${poolName}`);

    const {
      concurrentOperations = 20,
      operationDuration = 60000, // 1 minute
      operationType = 'ping'
    } = options;

    const client = this.clients.get(poolName);
    if (!client) {
      return { error: `Pool ${poolName} not found` };
    }

    const testStartTime = Date.now();
    const operationResults = [];
    const activeOperations = [];

    // Create concurrent operations
    for (let i = 0; i < concurrentOperations; i++) {
      const operation = this.performSingleOperation(client, operationType, i)
        .then(result => {
          operationResults.push(result);
        })
        .catch(error => {
          operationResults.push({ 
            operationId: i, 
            error: error.message, 
            success: false,
            timestamp: new Date()
          });
        });

      activeOperations.push(operation);
    }

    // Wait for all operations to complete or timeout
    try {
      await Promise.all(activeOperations);
    } catch (error) {
      console.warn('Some operations failed during load test:', error);
    }

    const testDuration = Date.now() - testStartTime;

    // Analyze results
    const successfulOperations = operationResults.filter(r => r.success);
    const failedOperations = operationResults.filter(r => !r.success);

    const loadTestResults = {
      poolName: poolName,
      testConfiguration: {
        concurrentOperations,
        operationDuration,
        operationType
      },
      results: {
        totalOperations: operationResults.length,
        successfulOperations: successfulOperations.length,
        failedOperations: failedOperations.length,
        successRate: (successfulOperations.length / operationResults.length) * 100,

        // Performance metrics
        averageResponseTime: successfulOperations.length > 0 ?
          successfulOperations.reduce((sum, op) => sum + op.responseTime, 0) / successfulOperations.length : 0,
        minResponseTime: successfulOperations.length > 0 ?
          Math.min(...successfulOperations.map(op => op.responseTime)) : 0,
        maxResponseTime: successfulOperations.length > 0 ?
          Math.max(...successfulOperations.map(op => op.responseTime)) : 0,

        // Test duration
        totalTestDuration: testDuration,
        operationsPerSecond: (operationResults.length / testDuration) * 1000
      },

      // Pool state during test
      poolMetrics: await this.getPoolMetrics(poolName),

      // Recommendations based on load test
      recommendations: this.generateLoadTestRecommendations(operationResults, concurrentOperations)
    };

    console.log(`Load test completed for ${poolName}:`, loadTestResults.results);
    return loadTestResults;
  }

  async performSingleOperation(client, operationType, operationId) {
    const startTime = Date.now();

    try {
      const db = client.db('admin');

      switch (operationType) {
        case 'ping':
          await db.command({ ping: 1 });
          break;
        case 'serverStatus':
          await db.command({ serverStatus: 1 });
          break;
        case 'listCollections':
          await db.listCollections().toArray();
          break;
        default:
          await db.command({ ping: 1 });
      }

      const responseTime = Date.now() - startTime;

      return {
        operationId,
        success: true,
        responseTime,
        timestamp: new Date()
      };

    } catch (error) {
      return {
        operationId,
        success: false,
        error: error.message,
        responseTime: Date.now() - startTime,
        timestamp: new Date()
      };
    }
  }

  generateLoadTestRecommendations(operationResults, concurrency) {
    const recommendations = [];
    const successRate = (operationResults.filter(r => r.success).length / operationResults.length) * 100;
    const avgResponseTime = operationResults
      .filter(r => r.success)
      .reduce((sum, op) => sum + op.responseTime, 0) / Math.max(1, operationResults.filter(r => r.success).length);

    if (successRate < 95) {
      recommendations.push({
        type: 'reliability',
        priority: 'high',
        message: `Success rate ${successRate.toFixed(1)}% is below target. Investigate connection failures.`
      });
    }

    if (avgResponseTime > 1000) {
      recommendations.push({
        type: 'performance', 
        priority: 'medium',
        message: `Average response time ${avgResponseTime.toFixed(0)}ms is high. Check network and server performance.`
      });
    }

    if (successRate > 99 && avgResponseTime < 100) {
      recommendations.push({
        type: 'scaling',
        priority: 'info',
        message: `Pool handles ${concurrency} concurrent operations well. Consider testing higher concurrency.`
      });
    }

    return recommendations;
  }

  async getAllPoolsStatus() {
    const poolsStatus = {};

    for (const poolName of this.clients.keys()) {
      try {
        poolsStatus[poolName] = await this.getPoolMetrics(poolName);
      } catch (error) {
        poolsStatus[poolName] = { error: error.message };
      }
    }

    return {
      timestamp: new Date(),
      pools: poolsStatus,
      summary: this.generateSystemSummary(poolsStatus)
    };
  }

  generateSystemSummary(poolsStatus) {
    const activePools = Object.keys(poolsStatus).length;
    let totalConnections = 0;
    let healthyPools = 0;
    let warnings = [];

    for (const [poolName, status] of Object.entries(poolsStatus)) {
      if (status.error) continue;

      totalConnections += status.connections?.total || 0;

      if (status.healthStatus === 'healthy') {
        healthyPools++;
      } else if (status.healthStatus !== 'healthy') {
        warnings.push(`Pool ${poolName} status: ${status.healthStatus}`);
      }

      // Check for high utilization
      if (status.connections?.utilization > 0.8) {
        warnings.push(`Pool ${poolName} utilization high: ${(status.connections.utilization * 100).toFixed(1)}%`);
      }
    }

    return {
      totalPools: activePools,
      healthyPools,
      totalConnections,
      systemHealth: healthyPools === activePools ? 'healthy' : 'degraded',
      warnings: warnings.length > 0 ? warnings : ['All systems operating normally']
    };
  }

  async shutdown() {
    console.log('Shutting down connection pool manager...');

    // Clear health check intervals
    for (const interval of this.healthCheckIntervals.values()) {
      clearInterval(interval);
    }
    this.healthCheckIntervals.clear();

    // Close all client connections
    for (const [poolName, client] of this.clients) {
      try {
        await client.close();
        console.log(`✅ Closed connection pool: ${poolName}`);
      } catch (error) {
        console.error(`Error closing pool ${poolName}:`, error);
      }
    }

    this.clients.clear();
    this.connectionMetrics.clear();
    console.log('Connection pool manager shutdown completed');
  }
}

// Export the connection pool manager
module.exports = { MongoConnectionPoolManager };

// Benefits of MongoDB Connection Pooling:
// - Native integration with MongoDB drivers eliminates external pooler complexity
// - Automatic connection lifecycle management with intelligent pool sizing
// - Built-in monitoring and health checking with comprehensive event tracking
// - Seamless integration with MongoDB replica sets and sharded clusters
// - Advanced performance optimization with compression and retry logic
// - Real-time connection pool metrics and health assessment
// - Production-ready failover and error handling capabilities
// - Integrated load testing and performance validation tools
// - Comprehensive configuration management across different environments
// - SQL-compatible connection management patterns through QueryLeaf integration

Understanding MongoDB Connection Pool Architecture

Advanced Connection Management Patterns

Implement sophisticated connection pool strategies for production deployments:

// Advanced connection pool patterns for production systems
class ProductionConnectionManager {
  constructor() {
    this.environmentPools = new Map();
    this.serviceRoutingMap = new Map();
    this.performanceProfiles = new Map();
    this.alertingSystem = null;
  }

  async initializeMultiEnvironmentPools() {
    console.log('Initializing multi-environment connection pools...');

    const environments = {
      // Production environment - high availability, strict consistency
      production: {
        primary: {
          uri: process.env.MONGODB_PRODUCTION_PRIMARY,
          options: {
            minPoolSize: 10,
            maxPoolSize: 100,
            maxIdleTimeMS: 30000,
            serverSelectionTimeoutMS: 5000,
            readPreference: 'primary',
            readConcern: { level: 'majority' },
            writeConcern: { w: 'majority', j: true },
            retryWrites: true,
            compressors: ['zlib'],
            appName: 'ProductionPrimary'
          }
        },
        secondary: {
          uri: process.env.MONGODB_PRODUCTION_SECONDARY,
          options: {
            minPoolSize: 5,
            maxPoolSize: 50,
            maxIdleTimeMS: 60000,
            readPreference: 'secondary',
            readConcern: { level: 'available' },
            writeConcern: { w: 1, j: false },
            compressors: ['zlib'],
            appName: 'ProductionSecondary'
          }
        }
      },

      // Staging environment - production-like with relaxed constraints
      staging: {
        primary: {
          uri: process.env.MONGODB_STAGING_URI,
          options: {
            minPoolSize: 3,
            maxPoolSize: 30,
            maxIdleTimeMS: 45000,
            serverSelectionTimeoutMS: 10000,
            readPreference: 'primaryPreferred',
            readConcern: { level: 'local' },
            writeConcern: { w: 'majority', j: false },
            appName: 'StagingEnvironment'
          }
        }
      },

      // Development environment - minimal resources, fast iteration
      development: {
        primary: {
          uri: process.env.MONGODB_DEV_URI || 'mongodb://localhost:27017/development',
          options: {
            minPoolSize: 1,
            maxPoolSize: 10,
            maxIdleTimeMS: 120000,
            serverSelectionTimeoutMS: 15000,
            readPreference: 'primaryPreferred',
            readConcern: { level: 'local' },
            writeConcern: { w: 1, j: false },
            appName: 'DevelopmentEnvironment'
          }
        }
      }
    };

    for (const [envName, envConfig] of Object.entries(environments)) {
      const envPools = new Map();

      for (const [poolType, poolConfig] of Object.entries(envConfig)) {
        try {
          const client = new MongoClient(poolConfig.uri, poolConfig.options);
          await client.connect();

          envPools.set(poolType, {
            client: client,
            config: poolConfig.options,
            healthStatus: 'initializing',
            createdAt: new Date()
          });

          console.log(`✅ Connected to ${envName}.${poolType}`);

        } catch (error) {
          console.error(`Failed to connect to ${envName}.${poolType}:`, error);
          envPools.set(poolType, {
            client: null,
            config: poolConfig.options,
            healthStatus: 'failed',
            error: error.message,
            createdAt: new Date()
          });
        }
      }

      this.environmentPools.set(envName, envPools);
    }

    // Initialize service-specific routing
    await this.setupServiceRouting();

    console.log('Multi-environment connection pools initialized');
    return this.getEnvironmentSummary();
  }

  async setupServiceRouting() {
    console.log('Setting up service-specific connection routing...');

    // Define service routing patterns
    const serviceRoutingConfig = {
      // Write-heavy services use primary connections
      'user-service': { 
        environment: 'production', 
        pool: 'primary',
        profile: 'write-heavy'
      },
      'order-service': { 
        environment: 'production', 
        pool: 'primary',
        profile: 'transactional'
      },

      // Read-heavy services can use secondary connections
      'analytics-service': { 
        environment: 'production', 
        pool: 'secondary',
        profile: 'read-heavy'
      },
      'reporting-service': { 
        environment: 'production', 
        pool: 'secondary',
        profile: 'batch-read'
      },

      // Development services
      'test-service': { 
        environment: 'development', 
        pool: 'primary',
        profile: 'development'
      }
    };

    for (const [serviceName, routing] of Object.entries(serviceRoutingConfig)) {
      this.serviceRoutingMap.set(serviceName, routing);
    }

    // Define performance profiles for different service types
    this.performanceProfiles.set('write-heavy', {
      expectedOperationsPerSecond: 1000,
      maxAcceptableLatency: 50,
      errorThreshold: 0.01,
      poolUtilizationTarget: 0.7
    });

    this.performanceProfiles.set('read-heavy', {
      expectedOperationsPerSecond: 5000,
      maxAcceptableLatency: 20,
      errorThreshold: 0.005,
      poolUtilizationTarget: 0.8
    });

    this.performanceProfiles.set('transactional', {
      expectedOperationsPerSecond: 500,
      maxAcceptableLatency: 100,
      errorThreshold: 0.001,
      poolUtilizationTarget: 0.6
    });

    this.performanceProfiles.set('batch-read', {
      expectedOperationsPerSecond: 100,
      maxAcceptableLatency: 1000,
      errorThreshold: 0.05,
      poolUtilizationTarget: 0.9
    });

    console.log('Service routing configuration completed');
  }

  getConnectionForService(serviceName) {
    const routing = this.serviceRoutingMap.get(serviceName);
    if (!routing) {
      throw new Error(`No routing configuration found for service: ${serviceName}`);
    }

    const envPools = this.environmentPools.get(routing.environment);
    if (!envPools) {
      throw new Error(`Environment ${routing.environment} not found`);
    }

    const poolInfo = envPools.get(routing.pool);
    if (!poolInfo || !poolInfo.client) {
      throw new Error(`Pool ${routing.pool} not available in ${routing.environment}`);
    }

    if (poolInfo.healthStatus !== 'healthy' && poolInfo.healthStatus !== 'initializing') {
      console.warn(`Using potentially unhealthy connection for ${serviceName}: ${poolInfo.healthStatus}`);
    }

    return {
      client: poolInfo.client,
      routing: routing,
      profile: this.performanceProfiles.get(routing.profile),
      poolInfo: poolInfo
    };
  }

  async performComprehensiveHealthCheck() {
    console.log('Performing comprehensive health check across all connection pools...');

    const healthReport = {
      timestamp: new Date(),
      environments: {},
      overallHealth: 'healthy',
      criticalIssues: [],
      warnings: [],
      recommendations: []
    };

    for (const [envName, envPools] of this.environmentPools) {
      const envHealth = {
        pools: {},
        healthStatus: 'healthy',
        totalConnections: 0,
        activeConnections: 0
      };

      for (const [poolType, poolInfo] of envPools) {
        if (!poolInfo.client) {
          envHealth.pools[poolType] = {
            status: 'failed',
            error: poolInfo.error || 'Client not initialized',
            lastChecked: new Date()
          };
          envHealth.healthStatus = 'degraded';
          continue;
        }

        try {
          const startTime = Date.now();
          const db = poolInfo.client.db('admin');

          // Perform health check operations
          await db.command({ ping: 1 });
          const serverStatus = await db.command({ serverStatus: 1 });
          const responseTime = Date.now() - startTime;

          // Extract connection metrics from server status
          const connections = serverStatus.connections || {};
          const opcounters = serverStatus.opcounters || {};

          envHealth.pools[poolType] = {
            status: responseTime < 1000 ? 'healthy' : 'slow',
            responseTime: responseTime,
            connections: {
              current: connections.current || 0,
              available: connections.available || 0,
              totalCreated: connections.totalCreated || 0
            },
            operations: {
              insert: opcounters.insert || 0,
              query: opcounters.query || 0,
              update: opcounters.update || 0,
              delete: opcounters.delete || 0
            },
            serverInfo: {
              version: serverStatus.version,
              uptime: serverStatus.uptime,
              host: serverStatus.host
            },
            lastChecked: new Date()
          };

          envHealth.totalConnections += connections.current || 0;

          if (responseTime > 2000) {
            healthReport.warnings.push(`Slow response time in ${envName}.${poolType}: ${responseTime}ms`);
          }

          if ((connections.available || 0) < 10) {
            healthReport.criticalIssues.push(`Low available connections in ${envName}.${poolType}: ${connections.available}`);
            envHealth.healthStatus = 'critical';
          }

        } catch (error) {
          envHealth.pools[poolType] = {
            status: 'unhealthy',
            error: error.message,
            lastChecked: new Date()
          };
          envHealth.healthStatus = 'degraded';
          healthReport.criticalIssues.push(`Health check failed for ${envName}.${poolType}: ${error.message}`);
        }
      }

      healthReport.environments[envName] = envHealth;
    }

    // Determine overall system health
    const hasCriticalIssues = healthReport.criticalIssues.length > 0;
    const hasWarnings = healthReport.warnings.length > 0;

    if (hasCriticalIssues) {
      healthReport.overallHealth = 'critical';
    } else if (hasWarnings) {
      healthReport.overallHealth = 'warning';
    } else {
      healthReport.overallHealth = 'healthy';
    }

    // Generate recommendations
    healthReport.recommendations = this.generateSystemRecommendations(healthReport);

    console.log(`Health check completed. Overall status: ${healthReport.overallHealth}`);
    return healthReport;
  }

  generateSystemRecommendations(healthReport) {
    const recommendations = [];

    // Check for connection pool sizing issues
    for (const [envName, envHealth] of Object.entries(healthReport.environments)) {
      for (const [poolType, poolHealth] of Object.entries(envHealth.pools)) {
        if (poolHealth.status === 'healthy' && poolHealth.connections) {
          const utilization = poolHealth.connections.current / 
            (poolHealth.connections.current + poolHealth.connections.available);

          if (utilization > 0.9) {
            recommendations.push({
              type: 'scaling',
              priority: 'high',
              environment: envName,
              pool: poolType,
              message: `Pool utilization ${(utilization * 100).toFixed(1)}% is very high. Consider increasing pool size.`
            });
          } else if (utilization < 0.2) {
            recommendations.push({
              type: 'optimization',
              priority: 'low',
              environment: envName,
              pool: poolType,
              message: `Pool utilization ${(utilization * 100).toFixed(1)}% is low. Consider reducing pool size for efficiency.`
            });
          }
        }
      }
    }

    // Performance-based recommendations
    if (healthReport.overallHealth === 'warning') {
      recommendations.push({
        type: 'monitoring',
        priority: 'medium',
        message: 'System has performance warnings. Increase monitoring frequency and consider scaling.'
      });
    }

    if (healthReport.criticalIssues.length > 0) {
      recommendations.push({
        type: 'immediate_action',
        priority: 'critical',
        message: 'Critical issues detected. Immediate investigation and resolution required.'
      });
    }

    return recommendations.length > 0 ? recommendations : [
      { type: 'status', priority: 'info', message: 'All connection pools operating optimally.' }
    ];
  }

  getEnvironmentSummary() {
    const summary = {
      environments: Array.from(this.environmentPools.keys()),
      totalPools: 0,
      healthyPools: 0,
      services: Array.from(this.serviceRoutingMap.keys()),
      profiles: Array.from(this.performanceProfiles.keys())
    };

    for (const envPools of this.environmentPools.values()) {
      for (const poolInfo of envPools.values()) {
        summary.totalPools++;
        if (poolInfo.healthStatus === 'healthy' || poolInfo.healthStatus === 'initializing') {
          summary.healthyPools++;
        }
      }
    }

    return summary;
  }
}

// Export the production connection manager
module.exports = { ProductionConnectionManager };

SQL-Style Connection Management with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB connection pool configuration and monitoring:

-- QueryLeaf connection pool management with SQL-familiar syntax

-- Configure connection pool settings
SET connection_pool_min_size = 5;
SET connection_pool_max_size = 50;
SET connection_pool_idle_timeout = 30000; -- milliseconds
SET connection_pool_server_selection_timeout = 10000;

-- Connection pool status monitoring
SELECT 
  pool_name,
  environment,

  -- Connection metrics
  total_connections,
  active_connections,
  available_connections,

  -- Pool configuration
  min_pool_size,
  max_pool_size,
  idle_timeout_ms,

  -- Performance metrics
  ROUND(connection_utilization::NUMERIC * 100, 2) as utilization_percent,
  ROUND(avg_operation_time::NUMERIC, 2) as avg_operation_time_ms,

  -- Health status
  health_status,
  last_health_check,

  -- Connection lifecycle
  connections_created,
  connections_destroyed,

  -- Error metrics
  ROUND(error_rate::NUMERIC * 100, 4) as error_rate_percent,
  checkout_failures,

  -- Status classification
  CASE 
    WHEN health_status = 'healthy' AND connection_utilization < 0.8 THEN 'optimal'
    WHEN health_status = 'healthy' AND connection_utilization >= 0.8 THEN 'high_load'
    WHEN health_status = 'warning' THEN 'needs_attention'  
    WHEN health_status = 'unhealthy' THEN 'critical'
    ELSE 'unknown'
  END as pool_status_category

FROM mongodb_connection_pools
ORDER BY environment, pool_name;

-- Connection pool performance analysis over time
WITH pool_metrics_hourly AS (
  SELECT 
    pool_name,
    DATE_TRUNC('hour', metric_timestamp) as hour_bucket,

    -- Aggregated metrics
    AVG(active_connections) as avg_active_connections,
    MAX(active_connections) as peak_active_connections,
    AVG(connection_utilization) as avg_utilization,
    MAX(connection_utilization) as peak_utilization,

    -- Performance indicators
    AVG(avg_operation_time) as avg_operation_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY avg_operation_time) as p95_operation_time,

    -- Error rates
    AVG(error_rate) as avg_error_rate,
    SUM(checkout_failures) as total_checkout_failures,

    -- Connection lifecycle
    SUM(connections_created) as connections_created_hourly,
    SUM(connections_destroyed) as connections_destroyed_hourly

  FROM connection_pool_metrics
  WHERE metric_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  GROUP BY pool_name, DATE_TRUNC('hour', metric_timestamp)
),

performance_trends AS (
  SELECT 
    *,
    -- Calculate hourly trends
    LAG(avg_utilization) OVER (PARTITION BY pool_name ORDER BY hour_bucket) as prev_utilization,
    LAG(avg_operation_time) OVER (PARTITION BY pool_name ORDER BY hour_bucket) as prev_operation_time,

    -- Performance change indicators
    CASE 
      WHEN avg_utilization - LAG(avg_utilization) OVER (PARTITION BY pool_name ORDER BY hour_bucket) > 0.1 
      THEN 'utilization_spike'
      WHEN avg_operation_time - LAG(avg_operation_time) OVER (PARTITION BY pool_name ORDER BY hour_bucket) > 100
      THEN 'latency_spike'
      ELSE 'stable'
    END as performance_change

  FROM pool_metrics_hourly
)

SELECT 
  pool_name,
  TO_CHAR(hour_bucket, 'YYYY-MM-DD HH24:00') as analysis_hour,

  -- Connection utilization
  ROUND(avg_utilization::NUMERIC * 100, 1) as avg_utilization_pct,
  ROUND(peak_utilization::NUMERIC * 100, 1) as peak_utilization_pct,

  -- Connection activity
  ROUND(avg_active_connections::NUMERIC, 1) as avg_active_connections,
  peak_active_connections,

  -- Performance metrics  
  ROUND(avg_operation_time::NUMERIC, 2) as avg_operation_time_ms,
  ROUND(p95_operation_time::NUMERIC, 2) as p95_operation_time_ms,

  -- Reliability metrics
  ROUND(avg_error_rate::NUMERIC * 100, 4) as avg_error_rate_pct,
  total_checkout_failures,

  -- Connection churn
  connections_created_hourly,
  connections_destroyed_hourly,
  connections_created_hourly - connections_destroyed_hourly as net_connection_change,

  -- Performance analysis
  performance_change,

  -- Health assessment
  CASE 
    WHEN avg_error_rate > 0.01 THEN 'high_error_rate'
    WHEN peak_utilization > 0.9 THEN 'utilization_critical'
    WHEN avg_operation_time > 500 THEN 'high_latency'
    WHEN total_checkout_failures > 10 THEN 'checkout_issues'
    ELSE 'healthy'
  END as health_indicator,

  -- Optimization recommendations
  CASE 
    WHEN peak_utilization > 0.9 AND performance_change = 'utilization_spike' 
    THEN 'Increase pool size immediately'
    WHEN avg_operation_time > 1000
    THEN 'Investigate database performance'
    WHEN total_checkout_failures > 50
    THEN 'Review timeout configuration'
    WHEN avg_utilization < 0.2 AND connections_created_hourly < 5
    THEN 'Consider reducing pool size'
    ELSE 'Performance within acceptable ranges'
  END as optimization_recommendation

FROM performance_trends
WHERE hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
ORDER BY pool_name, hour_bucket DESC;

-- Connection pool load testing and capacity planning
CREATE VIEW connection_pool_load_test AS
WITH load_test_scenarios AS (
  SELECT 
    scenario_name,
    concurrent_connections,
    operation_type,
    test_duration_minutes,
    target_operations_per_second
  FROM (VALUES 
    ('normal_load', 20, 'mixed', 10, 100),
    ('peak_load', 50, 'mixed', 10, 500),
    ('stress_test', 100, 'write_heavy', 5, 1000),
    ('sustained_read', 30, 'read_only', 30, 200)
  ) AS scenarios(scenario_name, concurrent_connections, operation_type, test_duration_minutes, target_operations_per_second)
),

load_test_results AS (
  SELECT 
    ltr.pool_name,
    ltr.scenario_name,
    lts.concurrent_connections,
    lts.operation_type,
    lts.test_duration_minutes,

    -- Performance results
    ltr.actual_operations_per_second,
    ltr.success_rate,
    ltr.avg_response_time_ms,
    ltr.p95_response_time_ms,
    ltr.max_response_time_ms,

    -- Resource utilization during test
    ltr.peak_connections_used,
    ltr.avg_connections_used,
    ltr.connection_utilization_peak,

    -- Error analysis
    ltr.total_errors,
    ltr.timeout_errors,
    ltr.connection_errors,

    -- Performance vs target
    ROUND((ltr.actual_operations_per_second / lts.target_operations_per_second::DECIMAL * 100)::NUMERIC, 1) 
      as performance_vs_target_pct,

    -- Load test assessment
    CASE 
      WHEN ltr.success_rate >= 99.5 AND ltr.avg_response_time_ms <= 100 THEN 'excellent'
      WHEN ltr.success_rate >= 95 AND ltr.avg_response_time_ms <= 500 THEN 'good'
      WHEN ltr.success_rate >= 90 AND ltr.avg_response_time_ms <= 1000 THEN 'acceptable'
      ELSE 'poor'
    END as performance_rating

  FROM load_test_results ltr
  JOIN load_test_scenarios lts ON ltr.scenario_name = lts.scenario_name
)

SELECT 
  pool_name,
  scenario_name,
  concurrent_connections,
  operation_type,

  -- Performance metrics
  actual_operations_per_second,
  performance_vs_target_pct,
  ROUND(success_rate::NUMERIC, 2) as success_rate_pct,

  -- Response time analysis
  avg_response_time_ms,
  p95_response_time_ms,
  max_response_time_ms,

  -- Resource utilization
  peak_connections_used,
  ROUND(connection_utilization_peak::NUMERIC * 100, 1) as peak_utilization_pct,

  -- Error analysis
  total_errors,
  timeout_errors,
  connection_errors,

  -- Performance assessment
  performance_rating,

  -- Capacity recommendations
  CASE performance_rating
    WHEN 'excellent' THEN 
      CONCAT('Pool can handle ', concurrent_connections + 20, ' concurrent connections')
    WHEN 'good' THEN 
      'Current capacity appropriate for this load pattern'
    WHEN 'acceptable' THEN
      'Monitor closely under sustained load'
    ELSE 
      CONCAT('Increase pool size or optimize operations for ', concurrent_connections, ' concurrent users')
  END as capacity_recommendation,

  -- Scaling suggestions
  CASE 
    WHEN connection_utilization_peak > 0.9 AND performance_rating IN ('good', 'excellent')
    THEN 'Pool size optimally configured'
    WHEN connection_utilization_peak > 0.9 AND performance_rating = 'poor'
    THEN 'Increase pool size significantly'
    WHEN connection_utilization_peak < 0.5
    THEN 'Pool may be oversized for this workload'
    ELSE 'Pool sizing appears appropriate'
  END as sizing_recommendation

FROM load_test_results
ORDER BY pool_name, concurrent_connections;

-- Real-time connection pool alerting
CREATE VIEW connection_pool_alerts AS
WITH current_pool_status AS (
  SELECT 
    pool_name,
    environment,
    health_status,
    connection_utilization,
    avg_operation_time,
    error_rate,
    available_connections,
    checkout_failures,
    last_health_check,

    -- Time since last health check
    EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - last_health_check)) as seconds_since_health_check

  FROM mongodb_connection_pools
  WHERE enabled = true
),

alert_conditions AS (
  SELECT 
    *,
    -- Define alert conditions
    CASE 
      WHEN health_status = 'unhealthy' THEN 'CRITICAL'
      WHEN connection_utilization > 0.95 THEN 'CRITICAL'
      WHEN available_connections < 2 THEN 'CRITICAL'
      WHEN seconds_since_health_check > 300 THEN 'CRITICAL' -- 5 minutes
      WHEN error_rate > 0.05 THEN 'HIGH'
      WHEN connection_utilization > 0.85 THEN 'HIGH'
      WHEN avg_operation_time > 2000 THEN 'HIGH'
      WHEN checkout_failures > 10 THEN 'MEDIUM'
      WHEN connection_utilization > 0.75 THEN 'MEDIUM'
      WHEN avg_operation_time > 1000 THEN 'MEDIUM'
      ELSE 'LOW'
    END as alert_severity,

    -- Generate alert messages
    ARRAY_REMOVE(ARRAY[
      CASE WHEN health_status = 'unhealthy' THEN 'Pool health check failing' END,
      CASE WHEN connection_utilization > 0.95 THEN 'Connection utilization critical' END,
      CASE WHEN available_connections < 2 THEN 'Available connections critically low' END,
      CASE WHEN error_rate > 0.05 THEN 'High error rate detected' END,
      CASE WHEN avg_operation_time > 2000 THEN 'High operation latency' END,
      CASE WHEN checkout_failures > 10 THEN 'Connection checkout failures' END,
      CASE WHEN seconds_since_health_check > 300 THEN 'Health check timeout' END
    ], NULL) as alert_reasons

  FROM current_pool_status
)

SELECT 
  pool_name,
  environment,
  alert_severity,
  alert_reasons,

  -- Current metrics for context
  ROUND(connection_utilization::NUMERIC * 100, 1) as utilization_pct,
  available_connections,
  ROUND(avg_operation_time::NUMERIC, 0) as avg_operation_time_ms,
  ROUND(error_rate::NUMERIC * 100, 2) as error_rate_pct,
  checkout_failures,

  -- Time context
  TO_CHAR(last_health_check, 'YYYY-MM-DD HH24:MI:SS') as last_health_check,
  ROUND(seconds_since_health_check::NUMERIC, 0) as seconds_since_health_check,

  -- Alert priority for incident management
  CASE alert_severity
    WHEN 'CRITICAL' THEN 1
    WHEN 'HIGH' THEN 2  
    WHEN 'MEDIUM' THEN 3
    ELSE 4
  END as alert_priority,

  -- Immediate action recommendations
  CASE alert_severity
    WHEN 'CRITICAL' THEN 'Immediate investigation required - potential service impact'
    WHEN 'HIGH' THEN 'Investigation needed within 15 minutes'
    WHEN 'MEDIUM' THEN 'Review within 1 hour'
    ELSE 'Monitor - no immediate action required'
  END as action_required

FROM alert_conditions
WHERE alert_severity != 'LOW'
ORDER BY alert_priority, pool_name;

-- QueryLeaf provides comprehensive connection pool management:
-- 1. SQL-familiar configuration syntax for pool sizing and timeouts
-- 2. Real-time monitoring with performance metrics and health indicators
-- 3. Historical analysis with trend detection and capacity planning
-- 4. Load testing capabilities with automated performance assessment
-- 5. Intelligent alerting with severity classification and action recommendations
-- 6. Multi-environment pool management with service routing optimization
-- 7. Production-ready monitoring with comprehensive error tracking
-- 8. Automated recommendations for scaling and optimization decisions
-- 9. Integration with MongoDB's native connection pool capabilities
-- 10. Enterprise-grade connection management with familiar SQL patterns

Best Practices for MongoDB Connection Pool Implementation

Connection Pool Sizing and Configuration

Essential practices for production connection pool deployments:

Right-Size Pool Limits: Configure minPoolSize and maxPoolSize based on actual concurrent load patterns
Timeout Management: Set appropriate timeouts for connection creation, idle time, and server selection
Environment-Specific Tuning: Use different pool configurations for production, staging, and development environments
Monitoring Integration: Implement comprehensive monitoring with health checks and performance metrics
Failover Planning: Configure connection pools to handle replica set failovers gracefully
Resource Optimization: Balance connection pool sizes with available system resources and database capacity

Performance Optimization Strategies

Optimize connection pools for maximum application performance:

Connection Reuse: Design application patterns that maximize connection reuse and minimize churn
Read Preference Strategy: Use appropriate read preferences to distribute load across replica set members
Write Concern Optimization: Configure write concerns that balance durability requirements with performance
Compression Settings: Enable compression for high-latency networks to improve throughput
Application-Level Pooling: Implement service-specific connection routing for optimal resource utilization
Load Testing: Regularly validate pool performance under realistic load conditions

Conclusion

MongoDB connection pooling provides comprehensive database connection management that eliminates the complexity and overhead of external connection pooling solutions. The integration of intelligent pool sizing, automatic health monitoring, and seamless replica set integration enables high-performance applications that scale efficiently with growing user demands.

Key MongoDB connection pooling benefits include:

Native Integration: Built-in connection pool management in MongoDB drivers eliminates external infrastructure
Intelligent Sizing: Automatic pool sizing based on application load with configurable limits and behaviors
Health Monitoring: Real-time connection health tracking with automatic failover and recovery capabilities
Performance Optimization: Advanced features like compression, retry logic, and read preference routing
Production Ready: Enterprise-grade monitoring, alerting, and capacity planning capabilities
SQL Compatibility: Familiar connection management patterns accessible through SQL-style operations

Whether you're building microservices architectures, high-throughput web applications, or data-intensive analytics platforms, MongoDB connection pooling with QueryLeaf's SQL-familiar interface provides the foundation for scalable database connectivity that maintains high performance while simplifying operational complexity.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB connection pools while providing SQL-familiar syntax for pool configuration, monitoring, and optimization. Advanced connection management patterns, load testing capabilities, and production-ready alerting are seamlessly accessible through familiar SQL constructs, making sophisticated database connection management both powerful and approachable for SQL-oriented teams.

The combination of MongoDB's intelligent connection pooling with familiar SQL-style management makes it an ideal platform for applications that require both high-performance database connectivity and operational simplicity, ensuring your database infrastructure scales efficiently while maintaining familiar development and operational patterns.

November 16, 2025
28 min read

MongoDB Schema Validation and Data Quality Management: Enterprise Data Integrity and Governance

Enterprise applications demand rigorous data quality standards to ensure compliance with regulatory requirements, maintain data integrity across distributed systems, and support reliable business intelligence and analytics. Traditional relational databases enforce data quality through rigid schema constraints, foreign key relationships, and check constraints, but these approaches often lack the flexibility required for modern applications dealing with evolving data structures and diverse data sources.

MongoDB Schema Validation provides comprehensive data quality management capabilities that combine flexible document validation rules with sophisticated data governance patterns. Unlike traditional database systems that require extensive schema migrations and rigid constraints, MongoDB's validation framework enables adaptive data quality enforcement that evolves with changing business requirements while maintaining enterprise-grade compliance and governance standards.

The Traditional Data Quality Challenge

Relational database data quality management often involves complex constraint management and limited flexibility:

-- Traditional PostgreSQL data quality management - rigid and maintenance-heavy

-- Customer data with extensive validation rules
CREATE TABLE customers (
    customer_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    company_name VARCHAR(500) NOT NULL,
    legal_entity_type VARCHAR(50) NOT NULL,

    -- Contact information with validation
    primary_email VARCHAR(320) NOT NULL,
    secondary_email VARCHAR(320),
    phone_primary VARCHAR(20) NOT NULL,
    phone_secondary VARCHAR(20),

    -- Address validation
    billing_address_line1 VARCHAR(200) NOT NULL,
    billing_address_line2 VARCHAR(200),
    billing_city VARCHAR(100) NOT NULL,
    billing_state VARCHAR(50) NOT NULL,
    billing_postal_code VARCHAR(20) NOT NULL,
    billing_country VARCHAR(3) NOT NULL DEFAULT 'USA',

    shipping_address_line1 VARCHAR(200),
    shipping_address_line2 VARCHAR(200),
    shipping_city VARCHAR(100),
    shipping_state VARCHAR(50),
    shipping_postal_code VARCHAR(20),
    shipping_country VARCHAR(3),

    -- Business information
    tax_id VARCHAR(50),
    business_registration_number VARCHAR(100),
    industry_code VARCHAR(10),
    annual_revenue DECIMAL(15,2),
    employee_count INTEGER,

    -- Account status and compliance
    account_status VARCHAR(20) NOT NULL DEFAULT 'active',
    credit_limit DECIMAL(12,2) DEFAULT 0.00,
    payment_terms INTEGER DEFAULT 30,

    -- Regulatory compliance fields
    gdpr_consent BOOLEAN DEFAULT false,
    gdpr_consent_date TIMESTAMP,
    ccpa_opt_out BOOLEAN DEFAULT false,
    data_retention_category VARCHAR(50) DEFAULT 'standard',

    -- Audit fields
    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    created_by UUID NOT NULL,
    updated_by UUID NOT NULL,

    -- Complex constraint validation
    CONSTRAINT chk_email_format 
        CHECK (primary_email ~* '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$'),
    CONSTRAINT chk_secondary_email_format 
        CHECK (secondary_email IS NULL OR secondary_email ~* '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$'),
    CONSTRAINT chk_phone_format 
        CHECK (phone_primary ~ '^\+?[1-9]\d{1,14}$'),
    CONSTRAINT chk_postal_code_format 
        CHECK (
            (billing_country = 'USA' AND billing_postal_code ~ '^\d{5}(-\d{4})?$') OR
            (billing_country = 'CAN' AND billing_postal_code ~ '^[A-Z]\d[A-Z] ?\d[A-Z]\d$') OR
            (billing_country != 'USA' AND billing_country != 'CAN')
        ),
    CONSTRAINT chk_account_status 
        CHECK (account_status IN ('active', 'suspended', 'closed', 'pending_approval')),
    CONSTRAINT chk_legal_entity_type 
        CHECK (legal_entity_type IN ('corporation', 'llc', 'partnership', 'sole_proprietorship', 'non_profit')),
    CONSTRAINT chk_revenue_positive 
        CHECK (annual_revenue IS NULL OR annual_revenue >= 0),
    CONSTRAINT chk_employee_count_positive 
        CHECK (employee_count IS NULL OR employee_count >= 0),
    CONSTRAINT chk_credit_limit_positive 
        CHECK (credit_limit >= 0),
    CONSTRAINT chk_payment_terms_valid 
        CHECK (payment_terms IN (15, 30, 45, 60, 90)),
    CONSTRAINT chk_gdpr_consent_date 
        CHECK (gdpr_consent = false OR gdpr_consent_date IS NOT NULL),

    -- Foreign key constraints
    CONSTRAINT fk_created_by FOREIGN KEY (created_by) REFERENCES users(user_id),
    CONSTRAINT fk_updated_by FOREIGN KEY (updated_by) REFERENCES users(user_id)
);

-- Additional validation through triggers for complex business rules
CREATE OR REPLACE FUNCTION validate_customer_data()
RETURNS TRIGGER AS $$
BEGIN
    -- Validate business registration requirements
    IF NEW.annual_revenue > 1000000 AND NEW.business_registration_number IS NULL THEN
        RAISE EXCEPTION 'Business registration number required for companies with revenue > $1M';
    END IF;

    -- Validate tax ID requirements
    IF NEW.legal_entity_type IN ('corporation', 'llc') AND NEW.tax_id IS NULL THEN
        RAISE EXCEPTION 'Tax ID required for corporations and LLCs';
    END IF;

    -- Validate shipping address consistency
    IF NEW.shipping_address_line1 IS NOT NULL THEN
        IF NEW.shipping_city IS NULL OR NEW.shipping_state IS NULL OR NEW.shipping_postal_code IS NULL THEN
            RAISE EXCEPTION 'Complete shipping address required when shipping address is provided';
        END IF;
    END IF;

    -- Industry-specific validation
    IF NEW.industry_code IS NOT NULL AND NOT EXISTS (
        SELECT 1 FROM industry_codes WHERE code = NEW.industry_code AND active = true
    ) THEN
        RAISE EXCEPTION 'Invalid or inactive industry code: %', NEW.industry_code;
    END IF;

    -- Credit limit validation based on business tier
    IF NEW.annual_revenue IS NOT NULL THEN
        CASE 
            WHEN NEW.annual_revenue < 100000 AND NEW.credit_limit > 10000 THEN
                RAISE EXCEPTION 'Credit limit too high for small business tier';
            WHEN NEW.annual_revenue < 1000000 AND NEW.credit_limit > 50000 THEN
                RAISE EXCEPTION 'Credit limit too high for medium business tier';
            WHEN NEW.annual_revenue >= 1000000 AND NEW.credit_limit > 500000 THEN
                RAISE EXCEPTION 'Credit limit exceeds maximum allowed';
        END CASE;
    END IF;

    -- Data retention policy validation
    IF NEW.data_retention_category NOT IN ('standard', 'extended', 'permanent', 'gdpr_restricted') THEN
        RAISE EXCEPTION 'Invalid data retention category';
    END IF;

    -- Update audit fields
    NEW.updated_at = CURRENT_TIMESTAMP;

    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER customer_validation_trigger
    BEFORE INSERT OR UPDATE ON customers
    FOR EACH ROW
    EXECUTE FUNCTION validate_customer_data();

-- Comprehensive data quality monitoring
CREATE VIEW customer_data_quality_report AS
WITH validation_checks AS (
    SELECT 
        customer_id,
        company_name,

        -- Email validation
        CASE WHEN primary_email ~* '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$' 
             THEN 'Valid' ELSE 'Invalid' END as primary_email_quality,
        CASE WHEN secondary_email IS NULL OR secondary_email ~* '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$' 
             THEN 'Valid' ELSE 'Invalid' END as secondary_email_quality,

        -- Phone validation
        CASE WHEN phone_primary ~ '^\+?[1-9]\d{1,14}$' 
             THEN 'Valid' ELSE 'Invalid' END as phone_quality,

        -- Address completeness
        CASE WHEN billing_address_line1 IS NOT NULL AND billing_city IS NOT NULL AND 
                  billing_state IS NOT NULL AND billing_postal_code IS NOT NULL 
             THEN 'Complete' ELSE 'Incomplete' END as billing_address_quality,

        -- Business data completeness
        CASE WHEN (legal_entity_type IN ('corporation', 'llc') AND tax_id IS NOT NULL) OR
                  (legal_entity_type NOT IN ('corporation', 'llc')) 
             THEN 'Valid' ELSE 'Missing Tax ID' END as tax_compliance_quality,

        -- GDPR compliance
        CASE WHEN gdpr_consent = true AND gdpr_consent_date IS NOT NULL 
             THEN 'Compliant' ELSE 'Non-Compliant' END as gdpr_compliance_quality,

        -- Data freshness
        CASE WHEN updated_at >= CURRENT_TIMESTAMP - INTERVAL '90 days' 
             THEN 'Fresh' ELSE 'Stale' END as data_freshness_quality
    FROM customers
),
quality_scores AS (
    SELECT *,
        -- Calculate overall quality score (0-100)
        (
            (CASE WHEN primary_email_quality = 'Valid' THEN 15 ELSE 0 END) +
            (CASE WHEN secondary_email_quality = 'Valid' THEN 5 ELSE 0 END) +
            (CASE WHEN phone_quality = 'Valid' THEN 10 ELSE 0 END) +
            (CASE WHEN billing_address_quality = 'Complete' THEN 20 ELSE 0 END) +
            (CASE WHEN tax_compliance_quality = 'Valid' THEN 25 ELSE 0 END) +
            (CASE WHEN gdpr_compliance_quality = 'Compliant' THEN 15 ELSE 0 END) +
            (CASE WHEN data_freshness_quality = 'Fresh' THEN 10 ELSE 0 END)
        ) as overall_quality_score
    FROM validation_checks
)
SELECT 
    customer_id,
    company_name,
    overall_quality_score,

    -- Quality classification
    CASE 
        WHEN overall_quality_score >= 90 THEN 'Excellent'
        WHEN overall_quality_score >= 75 THEN 'Good'
        WHEN overall_quality_score >= 60 THEN 'Fair'
        ELSE 'Poor'
    END as quality_rating,

    -- Specific quality issues
    CASE WHEN primary_email_quality = 'Invalid' THEN 'Fix primary email format' END as primary_issue,
    CASE WHEN billing_address_quality = 'Incomplete' THEN 'Complete billing address' END as address_issue,
    CASE WHEN tax_compliance_quality = 'Missing Tax ID' THEN 'Add required tax ID' END as compliance_issue,
    CASE WHEN gdpr_compliance_quality = 'Non-Compliant' THEN 'Update GDPR consent' END as gdpr_issue,
    CASE WHEN data_freshness_quality = 'Stale' THEN 'Data needs refresh' END as freshness_issue

FROM quality_scores
ORDER BY overall_quality_score ASC;

-- Problems with traditional data quality management:
-- 1. Rigid schema constraints that are difficult to modify as requirements evolve
-- 2. Complex trigger-based validation that is hard to maintain and debug
-- 3. Limited support for nested data structures and dynamic field validation
-- 4. Extensive migration requirements when adding new validation rules
-- 5. Performance overhead from complex constraint checking during writes
-- 6. Difficulty handling semi-structured data with varying field requirements
-- 7. Limited flexibility for different validation rules across data sources
-- 8. Complex reporting and monitoring of data quality across multiple tables
-- 9. Difficulty implementing conditional validation based on document context
-- 10. Expensive maintenance of validation logic across application and database layers

MongoDB provides flexible and comprehensive data quality management:

// MongoDB Schema Validation - flexible and comprehensive data quality management
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('enterprise_data_platform');

// Advanced Schema Validation and Data Quality Management
class MongoSchemaValidator {
  constructor(db) {
    this.db = db;
    this.validationRules = new Map();
    this.qualityMetrics = new Map();
    this.complianceReports = new Map();
  }

  async createComprehensiveCustomerValidation() {
    console.log('Creating comprehensive customer data validation schema...');

    const customersCollection = db.collection('customers');

    // Define comprehensive validation schema
    const customerValidationSchema = {
      $jsonSchema: {
        bsonType: "object",
        required: [
          "companyName", 
          "legalEntityType", 
          "primaryContact", 
          "billingAddress", 
          "accountStatus",
          "audit"
        ],
        properties: {
          _id: {
            bsonType: "objectId"
          },

          // Company identification
          companyName: {
            bsonType: "string",
            minLength: 2,
            maxLength: 500,
            pattern: "^[A-Za-z0-9\\s\\-.,&'()]+$",
            description: "Company name must be 2-500 characters, alphanumeric with common punctuation"
          },

          legalEntityType: {
            enum: ["corporation", "llc", "partnership", "sole_proprietorship", "non_profit", "government"],
            description: "Must be a valid legal entity type"
          },

          businessRegistrationNumber: {
            bsonType: "string",
            pattern: "^[A-Z0-9\\-]{5,20}$",
            description: "Business registration number format validation"
          },

          taxId: {
            bsonType: "string",
            pattern: "^\\d{2}-\\d{7}$|^\\d{9}$",
            description: "Tax ID must be EIN format (XX-XXXXXXX) or SSN format (XXXXXXXXX)"
          },

          // Contact information with nested validation
          primaryContact: {
            bsonType: "object",
            required: ["firstName", "lastName", "email", "phone"],
            properties: {
              title: {
                bsonType: "string",
                enum: ["Mr", "Mrs", "Ms", "Dr", "Prof"]
              },
              firstName: {
                bsonType: "string",
                minLength: 1,
                maxLength: 50,
                pattern: "^[A-Za-z\\s\\-']+$"
              },
              lastName: {
                bsonType: "string", 
                minLength: 1,
                maxLength: 50,
                pattern: "^[A-Za-z\\s\\-']+$"
              },
              email: {
                bsonType: "string",
                pattern: "^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$",
                description: "Must be a valid email format"
              },
              phone: {
                bsonType: "string",
                pattern: "^\\+?[1-9]\\d{1,14}$",
                description: "Must be a valid international phone format"
              },
              mobile: {
                bsonType: "string",
                pattern: "^\\+?[1-9]\\d{1,14}$"
              },
              jobTitle: {
                bsonType: "string",
                maxLength: 100
              },
              department: {
                bsonType: "string",
                maxLength: 50
              }
            },
            additionalProperties: false
          },

          // Additional contacts array validation
          additionalContacts: {
            bsonType: "array",
            maxItems: 10,
            items: {
              bsonType: "object",
              required: ["firstName", "lastName", "email", "role"],
              properties: {
                firstName: { bsonType: "string", minLength: 1, maxLength: 50 },
                lastName: { bsonType: "string", minLength: 1, maxLength: 50 },
                email: { bsonType: "string", pattern: "^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$" },
                phone: { bsonType: "string", pattern: "^\\+?[1-9]\\d{1,14}$" },
                role: {
                  enum: ["billing", "technical", "executive", "procurement", "legal"]
                }
              }
            }
          },

          // Address validation with conditional requirements
          billingAddress: {
            bsonType: "object",
            required: ["street1", "city", "state", "postalCode", "country"],
            properties: {
              street1: {
                bsonType: "string",
                minLength: 5,
                maxLength: 200
              },
              street2: {
                bsonType: "string",
                maxLength: 200
              },
              city: {
                bsonType: "string",
                minLength: 2,
                maxLength: 100,
                pattern: "^[A-Za-z\\s\\-']+$"
              },
              state: {
                bsonType: "string",
                minLength: 2,
                maxLength: 50
              },
              postalCode: {
                bsonType: "string",
                minLength: 3,
                maxLength: 20
              },
              country: {
                bsonType: "string",
                enum: ["USA", "CAN", "MEX", "GBR", "FRA", "DEU", "AUS", "JPN", "IND"],
                description: "Must be a supported country code"
              },
              coordinates: {
                bsonType: "object",
                properties: {
                  latitude: {
                    bsonType: "double",
                    minimum: -90,
                    maximum: 90
                  },
                  longitude: {
                    bsonType: "double", 
                    minimum: -180,
                    maximum: 180
                  }
                }
              }
            },
            additionalProperties: false
          },

          // Optional shipping address with same validation
          shippingAddress: {
            bsonType: "object",
            properties: {
              street1: { bsonType: "string", minLength: 5, maxLength: 200 },
              street2: { bsonType: "string", maxLength: 200 },
              city: { bsonType: "string", minLength: 2, maxLength: 100 },
              state: { bsonType: "string", minLength: 2, maxLength: 50 },
              postalCode: { bsonType: "string", minLength: 3, maxLength: 20 },
              country: { bsonType: "string", enum: ["USA", "CAN", "MEX", "GBR", "FRA", "DEU", "AUS", "JPN", "IND"] }
            }
          },

          // Business metrics with conditional validation
          businessMetrics: {
            bsonType: "object",
            properties: {
              annualRevenue: {
                bsonType: "double",
                minimum: 0,
                maximum: 999999999999.99
              },
              employeeCount: {
                bsonType: "int",
                minimum: 1,
                maximum: 1000000
              },
              industryCode: {
                bsonType: "string",
                pattern: "^[0-9]{4,6}$",
                description: "NAICS industry code format"
              },
              establishedYear: {
                bsonType: "int",
                minimum: 1800,
                maximum: 2025
              },
              publiclyTraded: {
                bsonType: "bool"
              },
              stockSymbol: {
                bsonType: "string",
                pattern: "^[A-Z]{1,5}$"
              }
            }
          },

          // Account management
          accountStatus: {
            enum: ["active", "suspended", "closed", "pending_approval", "under_review"],
            description: "Must be a valid account status"
          },

          creditProfile: {
            bsonType: "object",
            properties: {
              creditLimit: {
                bsonType: "double",
                minimum: 0,
                maximum: 10000000
              },
              paymentTerms: {
                bsonType: "int",
                enum: [15, 30, 45, 60, 90]
              },
              creditRating: {
                bsonType: "string",
                enum: ["AAA", "AA", "A", "BBB", "BB", "B", "CCC", "CC", "C", "D"]
              },
              lastCreditReview: {
                bsonType: "date"
              }
            }
          },

          // Compliance and regulatory requirements
          compliance: {
            bsonType: "object",
            properties: {
              gdprConsent: {
                bsonType: "object",
                required: ["hasConsent", "consentDate"],
                properties: {
                  hasConsent: { bsonType: "bool" },
                  consentDate: { bsonType: "date" },
                  consentVersion: { bsonType: "string" },
                  consentMethod: { 
                    enum: ["website", "email", "phone", "written", "implied"] 
                  },
                  dataProcessingPurposes: {
                    bsonType: "array",
                    items: {
                      enum: ["marketing", "analytics", "service_delivery", "legal_compliance", "research"]
                    }
                  }
                }
              },
              ccpaOptOut: {
                bsonType: "bool"
              },
              dataRetentionCategory: {
                enum: ["standard", "extended", "permanent", "gdpr_restricted", "legal_hold"],
                description: "Data retention policy classification"
              },
              piiClassification: {
                enum: ["none", "low", "medium", "high", "restricted"],
                description: "PII sensitivity classification"
              },
              regulatoryJurisdictions: {
                bsonType: "array",
                items: {
                  enum: ["US", "EU", "UK", "CA", "AU", "JP", "IN"]
                }
              }
            }
          },

          // Data quality and audit tracking
          audit: {
            bsonType: "object",
            required: ["createdAt", "updatedAt", "createdBy"],
            properties: {
              createdAt: { bsonType: "date" },
              updatedAt: { bsonType: "date" },
              createdBy: { bsonType: "objectId" },
              updatedBy: { bsonType: "objectId" },
              version: { bsonType: "int", minimum: 1 },
              lastValidated: { bsonType: "date" },
              dataSource: {
                enum: ["manual_entry", "import_csv", "api_integration", "web_form", "migration"]
              },
              validationStatus: {
                enum: ["pending", "validated", "needs_review", "rejected"]
              },
              changeHistory: {
                bsonType: "array",
                items: {
                  bsonType: "object",
                  properties: {
                    field: { bsonType: "string" },
                    oldValue: {},
                    newValue: {},
                    changedAt: { bsonType: "date" },
                    changedBy: { bsonType: "objectId" },
                    reason: { bsonType: "string" }
                  }
                }
              }
            }
          },

          // Integration and system metadata
          systemMetadata: {
            bsonType: "object",
            properties: {
              externalIds: {
                bsonType: "object",
                properties: {
                  crmId: { bsonType: "string" },
                  erpId: { bsonType: "string" }, 
                  accountingId: { bsonType: "string" },
                  legacyId: { bsonType: "string" }
                }
              },
              tags: {
                bsonType: "array",
                maxItems: 20,
                items: {
                  bsonType: "string",
                  pattern: "^[A-Za-z0-9\\-_]+$",
                  maxLength: 50
                }
              },
              customFields: {
                bsonType: "object",
                additionalProperties: true
              }
            }
          }
        },
        additionalProperties: false
      }
    };

    // Apply validation to collection
    await customersCollection.createCollection({
      validator: customerValidationSchema,
      validationLevel: "strict",
      validationAction: "error"
    });

    // Store validation schema for reference
    this.validationRules.set('customers', customerValidationSchema);

    console.log('✅ Comprehensive customer validation schema created');
    return customerValidationSchema;
  }

  async implementConditionalValidation() {
    console.log('Implementing advanced conditional validation rules...');

    // Create validation for different document types with conditional requirements
    const conditionalValidationRules = [
      {
        collectionName: 'customers',
        ruleName: 'corporation_tax_id_requirement',
        condition: {
          $expr: {
            $and: [
              { $in: ["$legalEntityType", ["corporation", "llc"]] },
              { $eq: [{ $type: "$taxId" }, "missing"] }
            ]
          }
        },
        errorMessage: "Tax ID is required for corporations and LLCs"
      },

      {
        collectionName: 'customers', 
        ruleName: 'high_revenue_business_registration',
        condition: {
          $expr: {
            $and: [
              { $gt: ["$businessMetrics.annualRevenue", 1000000] },
              { $eq: [{ $type: "$businessRegistrationNumber" }, "missing"] }
            ]
          }
        },
        errorMessage: "Business registration number required for companies with revenue > $1M"
      },

      {
        collectionName: 'customers',
        ruleName: 'public_company_stock_symbol',
        condition: {
          $expr: {
            $and: [
              { $eq: ["$businessMetrics.publiclyTraded", true] },
              { $eq: [{ $type: "$businessMetrics.stockSymbol" }, "missing"] }
            ]
          }
        },
        errorMessage: "Stock symbol required for publicly traded companies"
      },

      {
        collectionName: 'customers',
        ruleName: 'gdpr_consent_date_requirement',
        condition: {
          $expr: {
            $and: [
              { $in: ["EU", "$compliance.regulatoryJurisdictions"] },
              { $eq: ["$compliance.gdprConsent.hasConsent", true] },
              { $eq: [{ $type: "$compliance.gdprConsent.consentDate" }, "missing"] }
            ]
          }
        },
        errorMessage: "GDPR consent date required for EU jurisdiction customers"
      },

      {
        collectionName: 'customers',
        ruleName: 'high_credit_limit_validation',
        condition: {
          $expr: {
            $or: [
              {
                $and: [
                  { $lt: ["$businessMetrics.annualRevenue", 100000] },
                  { $gt: ["$creditProfile.creditLimit", 10000] }
                ]
              },
              {
                $and: [
                  { $lt: ["$businessMetrics.annualRevenue", 1000000] },
                  { $gt: ["$creditProfile.creditLimit", 50000] }
                ]
              },
              { $gt: ["$creditProfile.creditLimit", 500000] }
            ]
          }
        },
        errorMessage: "Credit limit exceeds allowed amount for business tier"
      }
    ];

    // Implement conditional validation using MongoDB's advanced features
    for (const rule of conditionalValidationRules) {
      try {
        const collection = this.db.collection(rule.collectionName);

        // Create a compound validator that includes the conditional rule
        const existingValidator = await collection.options();
        const currentSchema = existingValidator.validator || {};

        // Add conditional validation using $expr
        const enhancedSchema = {
          $and: [
            currentSchema,
            {
              $expr: {
                $not: rule.condition.$expr
              }
            }
          ]
        };

        await collection.updateOptions({
          validator: enhancedSchema,
          validationLevel: "strict",
          validationAction: "error"
        });

        console.log(`✅ Applied conditional rule: ${rule.ruleName}`);

      } catch (error) {
        console.error(`❌ Failed to apply rule ${rule.ruleName}:`, error.message);
      }
    }

    return conditionalValidationRules;
  }

  async validateDocumentQuality(collection, document) {
    console.log('Performing comprehensive document quality validation...');

    try {
      const qualityChecks = {
        documentId: document._id,
        timestamp: new Date(),
        overallScore: 0,
        checks: {},
        issues: [],
        recommendations: []
      };

      // 1. Schema compliance check
      try {
        const testResult = await collection.insertOne(document, { 
          bypassDocumentValidation: false,
          dryRun: true // MongoDB 5.0+ feature for validation testing
        });
        qualityChecks.checks.schemaCompliance = {
          status: 'PASS',
          score: 25,
          message: 'Document passes schema validation'
        };
        qualityChecks.overallScore += 25;
      } catch (validationError) {
        qualityChecks.checks.schemaCompliance = {
          status: 'FAIL',
          score: 0,
          message: validationError.message,
          details: validationError.errInfo
        };
        qualityChecks.issues.push('Schema validation failed: ' + validationError.message);
      }

      // 2. Data completeness analysis
      const completenessScore = this.analyzeDataCompleteness(document);
      qualityChecks.checks.completeness = completenessScore;
      qualityChecks.overallScore += completenessScore.score;

      // 3. Data consistency validation
      const consistencyScore = this.validateDataConsistency(document);
      qualityChecks.checks.consistency = consistencyScore;
      qualityChecks.overallScore += consistencyScore.score;

      // 4. Business rule validation
      const businessRuleScore = await this.validateBusinessRules(document);
      qualityChecks.checks.businessRules = businessRuleScore;
      qualityChecks.overallScore += businessRuleScore.score;

      // 5. Data freshness analysis
      const freshnessScore = this.analyzeFreshness(document);
      qualityChecks.checks.freshness = freshnessScore;
      qualityChecks.overallScore += freshnessScore.score;

      // Generate quality rating and recommendations
      qualityChecks.qualityRating = this.calculateQualityRating(qualityChecks.overallScore);
      qualityChecks.recommendations = this.generateQualityRecommendations(qualityChecks);

      // Store quality metrics
      await this.recordQualityMetrics(qualityChecks);

      return qualityChecks;

    } catch (error) {
      console.error('Error during quality validation:', error);
      throw error;
    }
  }

  analyzeDataCompleteness(document) {
    const analysis = {
      status: 'PASS',
      score: 0,
      details: {},
      recommendations: []
    };

    // Define critical fields and their weights
    const criticalFields = {
      'companyName': 5,
      'legalEntityType': 3,
      'primaryContact.email': 5,
      'primaryContact.phone': 3,
      'billingAddress.street1': 4,
      'billingAddress.city': 3,
      'billingAddress.state': 3,
      'billingAddress.postalCode': 3,
      'billingAddress.country': 2,
      'accountStatus': 2
    };

    let totalWeight = 0;
    let presentWeight = 0;

    Object.entries(criticalFields).forEach(([fieldPath, weight]) => {
      totalWeight += weight;

      const fieldValue = this.getNestedValue(document, fieldPath);
      if (fieldValue !== undefined && fieldValue !== null && fieldValue !== '') {
        presentWeight += weight;
        analysis.details[fieldPath] = { present: true, weight };
      } else {
        analysis.details[fieldPath] = { present: false, weight };
        analysis.recommendations.push(`Complete missing field: ${fieldPath}`);
      }
    });

    analysis.score = Math.round((presentWeight / totalWeight) * 20); // Max 20 points
    analysis.completenessPercentage = Math.round((presentWeight / totalWeight) * 100);

    if (analysis.completenessPercentage < 80) {
      analysis.status = 'NEEDS_IMPROVEMENT';
    }

    return analysis;
  }

  validateDataConsistency(document) {
    const analysis = {
      status: 'PASS',
      score: 15, // Start with full points, deduct for issues
      issues: [],
      recommendations: []
    };

    // Consistency checks
    const checks = [
      // Email format consistency
      {
        check: () => {
          const emails = [
            document.primaryContact?.email,
            ...(document.additionalContacts || []).map(c => c.email)
          ].filter(email => email);

          const emailPattern = /^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$/;
          return emails.every(email => emailPattern.test(email));
        },
        name: 'Email Format Consistency',
        penalty: 3,
        recommendation: 'Fix invalid email formats'
      },

      // Phone format consistency  
      {
        check: () => {
          const phones = [
            document.primaryContact?.phone,
            document.primaryContact?.mobile,
            ...(document.additionalContacts || []).map(c => c.phone)
          ].filter(phone => phone);

          const phonePattern = /^\+?[1-9]\d{1,14}$/;
          return phones.every(phone => phonePattern.test(phone));
        },
        name: 'Phone Format Consistency',
        penalty: 2,
        recommendation: 'Standardize phone number formats'
      },

      // Address consistency
      {
        check: () => {
          if (!document.shippingAddress) return true;

          const billing = document.billingAddress;
          const shipping = document.shippingAddress;

          return billing.country === shipping.country;
        },
        name: 'Address Country Consistency',
        penalty: 2,
        recommendation: 'Verify address country consistency'
      },

      // Legal entity and tax ID consistency
      {
        check: () => {
          const entityType = document.legalEntityType;
          const hasTaxId = !!document.taxId;

          if (['corporation', 'llc'].includes(entityType)) {
            return hasTaxId;
          }
          return true;
        },
        name: 'Tax ID Requirement Consistency',
        penalty: 5,
        recommendation: 'Add required tax ID for legal entity type'
      }
    ];

    checks.forEach(check => {
      try {
        if (!check.check()) {
          analysis.score -= check.penalty;
          analysis.issues.push(check.name);
          analysis.recommendations.push(check.recommendation);
        }
      } catch (error) {
        console.warn(`Consistency check failed: ${check.name}`, error);
      }
    });

    if (analysis.issues.length > 0) {
      analysis.status = 'NEEDS_IMPROVEMENT';
    }

    analysis.score = Math.max(0, analysis.score);
    return analysis;
  }

  async validateBusinessRules(document) {
    const analysis = {
      status: 'PASS',
      score: 20, // Start with full points
      violations: [],
      recommendations: []
    };

    // Business rule validations
    const businessRules = [
      {
        name: 'High Revenue Registration Requirement',
        validate: (doc) => {
          const revenue = doc.businessMetrics?.annualRevenue;
          return !revenue || revenue <= 1000000 || !!doc.businessRegistrationNumber;
        },
        penalty: 8,
        message: 'Companies with >$1M revenue require business registration number'
      },

      {
        name: 'Public Company Stock Symbol',
        validate: (doc) => {
          const isPublic = doc.businessMetrics?.publiclyTraded;
          return !isPublic || !!doc.businessMetrics?.stockSymbol;
        },
        penalty: 3,
        message: 'Publicly traded companies must have stock symbol'
      },

      {
        name: 'Credit Limit Business Tier Validation',
        validate: (doc) => {
          const revenue = doc.businessMetrics?.annualRevenue;
          const creditLimit = doc.creditProfile?.creditLimit;

          if (!revenue || !creditLimit) return true;

          if (revenue < 100000 && creditLimit > 10000) return false;
          if (revenue < 1000000 && creditLimit > 50000) return false;
          if (creditLimit > 500000) return false;

          return true;
        },
        penalty: 5,
        message: 'Credit limit exceeds allowed amount for business tier'
      },

      {
        name: 'GDPR Compliance for EU Customers',
        validate: (doc) => {
          const jurisdictions = doc.compliance?.regulatoryJurisdictions || [];
          const hasGdprConsent = doc.compliance?.gdprConsent?.hasConsent;
          const hasConsentDate = !!doc.compliance?.gdprConsent?.consentDate;

          if (jurisdictions.includes('EU')) {
            return hasGdprConsent && hasConsentDate;
          }
          return true;
        },
        penalty: 7,
        message: 'EU customers require GDPR consent with date'
      }
    ];

    for (const rule of businessRules) {
      try {
        if (!rule.validate(document)) {
          analysis.score -= rule.penalty;
          analysis.violations.push(rule.name);
          analysis.recommendations.push(rule.message);
        }
      } catch (error) {
        console.warn(`Business rule validation failed: ${rule.name}`, error);
      }
    }

    if (analysis.violations.length > 0) {
      analysis.status = 'NEEDS_REVIEW';
    }

    analysis.score = Math.max(0, analysis.score);
    return analysis;
  }

  analyzeFreshness(document) {
    const analysis = {
      status: 'PASS',
      score: 0,
      ageInDays: 0,
      recommendations: []
    };

    const updatedAt = new Date(document.audit?.updatedAt || document.audit?.createdAt);
    const now = new Date();
    const daysDifference = Math.floor((now - updatedAt) / (1000 * 60 * 60 * 24));

    analysis.ageInDays = daysDifference;

    // Freshness scoring
    if (daysDifference <= 30) {
      analysis.score = 20; // Fresh data
      analysis.status = 'FRESH';
    } else if (daysDifference <= 90) {
      analysis.score = 15; // Recent data
      analysis.status = 'RECENT';
    } else if (daysDifference <= 180) {
      analysis.score = 10; // Aging data
      analysis.status = 'AGING';
      analysis.recommendations.push('Consider updating customer information');
    } else if (daysDifference <= 365) {
      analysis.score = 5; // Stale data
      analysis.status = 'STALE';
      analysis.recommendations.push('Customer data needs refresh - over 6 months old');
    } else {
      analysis.score = 0; // Very stale
      analysis.status = 'VERY_STALE';
      analysis.recommendations.push('Critical: Customer data is over 1 year old');
    }

    return analysis;
  }

  calculateQualityRating(overallScore) {
    if (overallScore >= 90) return 'EXCELLENT';
    if (overallScore >= 75) return 'GOOD';
    if (overallScore >= 60) return 'FAIR';
    if (overallScore >= 40) return 'POOR';
    return 'CRITICAL';
  }

  generateQualityRecommendations(qualityChecks) {
    const recommendations = [];

    // Collect recommendations from all checks
    Object.values(qualityChecks.checks).forEach(check => {
      if (check.recommendations) {
        recommendations.push(...check.recommendations);
      }
    });

    // Add overall recommendations based on score
    if (qualityChecks.overallScore < 40) {
      recommendations.unshift('CRITICAL: Immediate data quality improvement required');
    } else if (qualityChecks.overallScore < 60) {
      recommendations.unshift('Multiple data quality issues need addressing');
    } else if (qualityChecks.overallScore < 75) {
      recommendations.unshift('Minor improvements needed for optimal data quality');
    }

    return [...new Set(recommendations)]; // Remove duplicates
  }

  async recordQualityMetrics(qualityChecks) {
    try {
      await this.db.collection('data_quality_metrics').insertOne({
        ...qualityChecks,
        recordedAt: new Date()
      });

      // Update in-memory metrics for reporting
      const key = `${qualityChecks.documentId}_${Date.now()}`;
      this.qualityMetrics.set(key, qualityChecks);

    } catch (error) {
      console.warn('Failed to record quality metrics:', error);
    }
  }

  async generateComplianceReport() {
    console.log('Generating comprehensive compliance and data quality report...');

    try {
      const customersCollection = this.db.collection('customers');

      // Comprehensive compliance analysis pipeline
      const complianceAnalysis = await customersCollection.aggregate([
        // Stage 1: Add computed compliance fields
        {
          $addFields: {
            // GDPR compliance status
            gdprCompliant: {
              $cond: {
                if: { $in: ["EU", "$compliance.regulatoryJurisdictions"] },
                then: {
                  $and: [
                    { $eq: ["$compliance.gdprConsent.hasConsent", true] },
                    { $ne: ["$compliance.gdprConsent.consentDate", null] }
                  ]
                },
                else: true
              }
            },

            // Tax compliance status
            taxCompliant: {
              $cond: {
                if: { $in: ["$legalEntityType", ["corporation", "llc"]] },
                then: { $ne: ["$taxId", null] },
                else: true
              }
            },

            // Data completeness score
            completenessScore: {
              $let: {
                vars: {
                  requiredFields: [
                    { $ne: ["$companyName", null] },
                    { $ne: ["$primaryContact.email", null] },
                    { $ne: ["$primaryContact.phone", null] },
                    { $ne: ["$billingAddress.street1", null] },
                    { $ne: ["$billingAddress.city", null] },
                    { $ne: ["$accountStatus", null] }
                  ]
                },
                in: {
                  $multiply: [
                    { $divide: [
                      { $size: { $filter: {
                        input: "$$requiredFields",
                        cond: { $eq: ["$$this", true] }
                      }}},
                      { $size: "$$requiredFields" }
                    ]},
                    100
                  ]
                }
              }
            },

            // Data freshness
            dataAge: {
              $divide: [
                { $subtract: [new Date(), "$audit.updatedAt"] },
                86400000 // Convert to days
              ]
            }
          }
        },

        // Stage 2: Quality classification
        {
          $addFields: {
            qualityRating: {
              $switch: {
                branches: [
                  { 
                    case: { 
                      $and: [
                        { $gte: ["$completenessScore", 95] },
                        "$gdprCompliant",
                        "$taxCompliant",
                        { $lte: ["$dataAge", 90] }
                      ]
                    }, 
                    then: "EXCELLENT" 
                  },
                  { 
                    case: { 
                      $and: [
                        { $gte: ["$completenessScore", 80] },
                        "$gdprCompliant",
                        "$taxCompliant",
                        { $lte: ["$dataAge", 180] }
                      ]
                    }, 
                    then: "GOOD" 
                  },
                  { 
                    case: { 
                      $and: [
                        { $gte: ["$completenessScore", 60] },
                        { $lte: ["$dataAge", 365] }
                      ]
                    }, 
                    then: "FAIR" 
                  }
                ],
                default: "POOR"
              }
            },

            complianceIssues: {
              $concatArrays: [
                { $cond: [{ $not: "$gdprCompliant" }, ["GDPR_NON_COMPLIANT"], []] },
                { $cond: [{ $not: "$taxCompliant" }, ["MISSING_TAX_ID"], []] },
                { $cond: [{ $lt: ["$completenessScore", 80] }, ["INCOMPLETE_DATA"], []] },
                { $cond: [{ $gt: ["$dataAge", 365] }, ["STALE_DATA"], []] }
              ]
            }
          }
        },

        // Stage 3: Aggregate compliance statistics
        {
          $group: {
            _id: null,

            // Total counts
            totalCustomers: { $sum: 1 },

            // Compliance counts
            gdprCompliantCount: { $sum: { $cond: ["$gdprCompliant", 1, 0] } },
            taxCompliantCount: { $sum: { $cond: ["$taxCompliant", 1, 0] } },

            // Quality distribution
            excellentQuality: { $sum: { $cond: [{ $eq: ["$qualityRating", "EXCELLENT"] }, 1, 0] } },
            goodQuality: { $sum: { $cond: [{ $eq: ["$qualityRating", "GOOD"] }, 1, 0] } },
            fairQuality: { $sum: { $cond: [{ $eq: ["$qualityRating", "FAIR"] }, 1, 0] } },
            poorQuality: { $sum: { $cond: [{ $eq: ["$qualityRating", "POOR"] }, 1, 0] } },

            // Completeness metrics
            avgCompletenessScore: { $avg: "$completenessScore" },
            minCompletenessScore: { $min: "$completenessScore" },

            // Freshness metrics
            avgDataAge: { $avg: "$dataAge" },
            staleDataCount: { $sum: { $cond: [{ $gt: ["$dataAge", 365] }, 1, 0] } },

            // Issue tracking
            allIssues: { $push: "$complianceIssues" },

            // Sample records for detailed analysis
            qualityExamples: {
              $push: {
                $cond: [
                  { $lte: [{ $rand: {} }, 0.1] }, // Sample 10%
                  {
                    customerId: "$_id",
                    companyName: "$companyName",
                    qualityRating: "$qualityRating",
                    completenessScore: "$completenessScore",
                    dataAge: "$dataAge",
                    issues: "$complianceIssues"
                  },
                  null
                ]
              }
            }
          }
        },

        // Stage 4: Calculate percentages and final metrics
        {
          $addFields: {
            // Compliance percentages
            gdprComplianceRate: { $multiply: [{ $divide: ["$gdprCompliantCount", "$totalCustomers"] }, 100] },
            taxComplianceRate: { $multiply: [{ $divide: ["$taxCompliantCount", "$totalCustomers"] }, 100] },

            // Quality distribution percentages
            excellentQualityPct: { $multiply: [{ $divide: ["$excellentQuality", "$totalCustomers"] }, 100] },
            goodQualityPct: { $multiply: [{ $divide: ["$goodQuality", "$totalCustomers"] }, 100] },
            fairQualityPct: { $multiply: [{ $divide: ["$fairQuality", "$totalCustomers"] }, 100] },
            poorQualityPct: { $multiply: [{ $divide: ["$poorQuality", "$totalCustomers"] }, 100] },

            // Data freshness metrics
            staleDataRate: { $multiply: [{ $divide: ["$staleDataCount", "$totalCustomers"] }, 100] },

            // Issue analysis
            issueFrequency: {
              $reduce: {
                input: "$allIssues",
                initialValue: {},
                in: {
                  $mergeObjects: [
                    "$$value",
                    {
                      $arrayToObject: {
                        $map: {
                          input: "$$this",
                          as: "issue",
                          in: {
                            k: "$$issue",
                            v: { $add: [{ $ifNull: [{ $getField: { field: "$$issue", input: "$$value" } }, 0] }, 1] }
                          }
                        }
                      }
                    }
                  ]
                }
              }
            },

            // Filter null examples
            qualityExamples: {
              $filter: {
                input: "$qualityExamples",
                cond: { $ne: ["$$this", null] }
              }
            }
          }
        },

        // Stage 5: Final report structure
        {
          $project: {
            _id: 0,
            reportGenerated: new Date(),
            summary: {
              totalCustomers: "$totalCustomers",
              overallComplianceScore: {
                $round: [
                  { $avg: ["$gdprComplianceRate", "$taxComplianceRate"] }, 
                  1
                ]
              },
              avgDataQuality: {
                $round: ["$avgCompletenessScore", 1]
              },
              avgDataAgedays: {
                $round: ["$avgDataAge", 0]
              }
            },

            compliance: {
              gdpr: {
                compliantCount: "$gdprCompliantCount",
                complianceRate: { $round: ["$gdprComplianceRate", 1] }
              },
              tax: {
                compliantCount: "$taxCompliantCount", 
                complianceRate: { $round: ["$taxComplianceRate", 1] }
              }
            },

            dataQuality: {
              distribution: {
                excellent: { count: "$excellentQuality", percentage: { $round: ["$excellentQualityPct", 1] } },
                good: { count: "$goodQuality", percentage: { $round: ["$goodQualityPct", 1] } },
                fair: { count: "$fairQuality", percentage: { $round: ["$fairQualityPct", 1] } },
                poor: { count: "$poorQuality", percentage: { $round: ["$poorQualityPct", 1] } }
              },
              completeness: {
                average: { $round: ["$avgCompletenessScore", 1] },
                minimum: { $round: ["$minCompletenessScore", 1] }
              }
            },

            dataFreshness: {
              averageAge: { $round: ["$avgDataAge", 0] },
              staleRecords: { count: "$staleDataCount", percentage: { $round: ["$staleDataRate", 1] } }
            },

            topIssues: "$issueFrequency",
            sampleRecords: { $slice: ["$qualityExamples", 10] }
          }
        }
      ]).toArray();

      const report = complianceAnalysis[0] || {};

      // Generate recommendations based on findings
      report.recommendations = this.generateComplianceRecommendations(report);

      // Store report for historical tracking
      await this.db.collection('compliance_reports').insertOne({
        ...report,
        reportType: 'comprehensive_compliance_audit',
        generatedBy: 'schema_validator_system'
      });

      this.complianceReports.set('latest', report);

      console.log('\n📋 Compliance and Data Quality Report Summary:');
      console.log(`Total Customers Analyzed: ${report.summary?.totalCustomers || 0}`);
      console.log(`Overall Compliance Score: ${report.summary?.overallComplianceScore || 0}%`);
      console.log(`Average Data Quality: ${report.summary?.avgDataQuality || 0}%`);
      console.log(`GDPR Compliance Rate: ${report.compliance?.gdpr?.complianceRate || 0}%`);
      console.log(`Tax Compliance Rate: ${report.compliance?.tax?.complianceRate || 0}%`);

      if (report.recommendations?.length > 0) {
        console.log('\n💡 Key Recommendations:');
        report.recommendations.slice(0, 5).forEach(rec => {
          console.log(`  • ${rec}`);
        });
      }

      return report;

    } catch (error) {
      console.error('Error generating compliance report:', error);
      throw error;
    }
  }

  generateComplianceRecommendations(report) {
    const recommendations = [];

    // GDPR compliance recommendations
    if (report.compliance?.gdpr?.complianceRate < 95) {
      recommendations.push('Improve GDPR compliance by ensuring all EU customers have documented consent');
    }

    // Tax compliance recommendations
    if (report.compliance?.tax?.complianceRate < 95) {
      recommendations.push('Add missing tax IDs for corporations and LLCs');
    }

    // Data quality recommendations
    const qualityDist = report.dataQuality?.distribution;
    if (qualityDist?.poor?.percentage > 10) {
      recommendations.push('Critical: Over 10% of customer records have poor data quality');
    }

    if (qualityDist?.excellent?.percentage < 50) {
      recommendations.push('Implement data quality improvement program - less than 50% excellent quality');
    }

    // Data freshness recommendations
    if (report.dataFreshness?.staleRecords?.percentage > 15) {
      recommendations.push('Establish customer data refresh program for stale records');
    }

    // Issue-specific recommendations
    const topIssues = report.topIssues || {};
    if (topIssues.INCOMPLETE_DATA > topIssues.totalCustomers * 0.2) {
      recommendations.push('Implement required field completion workflows');
    }

    return recommendations;
  }

  // Utility methods
  getNestedValue(obj, path) {
    return path.split('.').reduce((current, key) => {
      return current && current[key] !== undefined ? current[key] : undefined;
    }, obj);
  }
}

// Export the schema validator
module.exports = { MongoSchemaValidator };

// Benefits of MongoDB Schema Validation:
// - Flexible document validation with evolving schema requirements
// - Comprehensive data quality management and automated quality scoring
// - Advanced conditional validation based on document context
// - Enterprise-grade compliance tracking and regulatory reporting
// - Automated data quality monitoring and issue identification
// - Integration with business rules and custom validation logic
// - Real-time validation feedback and quality metrics
// - Support for complex nested document validation
// - Automated compliance reporting and audit trails
// - SQL-compatible data governance patterns through QueryLeaf integration

Understanding MongoDB Schema Validation Architecture

Advanced Validation Patterns

MongoDB's validation system supports sophisticated data governance strategies for enterprise applications:

// Advanced validation patterns and data governance implementation
class EnterpriseDataGovernance {
  constructor(db) {
    this.db = db;
    this.governanceRules = new Map();
    this.qualityDashboards = new Map();
    this.complianceAudits = new Map();
  }

  async implementDataLineageTracking() {
    console.log('Implementing comprehensive data lineage and governance tracking...');

    // Create data lineage collection with validation
    const lineageSchema = {
      $jsonSchema: {
        bsonType: "object",
        required: ["sourceSystem", "targetCollection", "transformationRules", "timestamp", "dataClassification"],
        properties: {
          sourceSystem: {
            bsonType: "string",
            enum: ["crm", "erp", "web_form", "api", "batch_import", "manual_entry"]
          },
          targetCollection: { bsonType: "string" },
          documentId: { bsonType: "objectId" },

          transformationRules: {
            bsonType: "array",
            items: {
              bsonType: "object",
              required: ["field", "operation", "appliedAt"],
              properties: {
                field: { bsonType: "string" },
                operation: {
                  enum: ["validation", "enrichment", "standardization", "encryption", "anonymization"]
                },
                appliedAt: { bsonType: "date" },
                appliedBy: { bsonType: "string" },
                previousValue: {},
                newValue: {},
                validationResult: {
                  bsonType: "object",
                  properties: {
                    passed: { bsonType: "bool" },
                    score: { bsonType: "double", minimum: 0, maximum: 100 },
                    issues: { bsonType: "array", items: { bsonType: "string" } }
                  }
                }
              }
            }
          },

          dataClassification: {
            bsonType: "object",
            required: ["piiLevel", "retentionClass", "accessLevel"],
            properties: {
              piiLevel: {
                enum: ["none", "low", "medium", "high", "restricted"]
              },
              retentionClass: {
                enum: ["standard", "extended", "permanent", "legal_hold", "gdpr_restricted"]
              },
              accessLevel: {
                enum: ["public", "internal", "confidential", "restricted", "top_secret"]
              },
              encryptionRequired: { bsonType: "bool" },
              auditRequired: { bsonType: "bool" }
            }
          },

          qualityMetrics: {
            bsonType: "object",
            properties: {
              completenessScore: { bsonType: "double", minimum: 0, maximum: 100 },
              accuracyScore: { bsonType: "double", minimum: 0, maximum: 100 },
              consistencyScore: { bsonType: "double", minimum: 0, maximum: 100 },
              timelinessScore: { bsonType: "double", minimum: 0, maximum: 100 },
              overallQualityScore: { bsonType: "double", minimum: 0, maximum: 100 }
            }
          },

          complianceChecks: {
            bsonType: "object",
            properties: {
              gdprCompliant: { bsonType: "bool" },
              ccpaCompliant: { bsonType: "bool" },
              hipaaCompliant: { bsonType: "bool" },
              sox404Compliant: { bsonType: "bool" },
              complianceScore: { bsonType: "double", minimum: 0, maximum: 100 },
              lastAuditDate: { bsonType: "date" },
              nextAuditDue: { bsonType: "date" }
            }
          },

          timestamp: { bsonType: "date" },
          processingLatency: { bsonType: "double" },

          audit: {
            bsonType: "object",
            required: ["createdBy", "createdAt"],
            properties: {
              createdBy: { bsonType: "string" },
              createdAt: { bsonType: "date" },
              version: { bsonType: "string" },
              correlationId: { bsonType: "string" }
            }
          }
        }
      }
    };

    await this.db.createCollection('data_lineage', {
      validator: lineageSchema,
      validationLevel: "strict",
      validationAction: "error"
    });

    console.log('✅ Data lineage tracking implemented');
    return lineageSchema;
  }

  async createDataQualityDashboard() {
    console.log('Creating real-time data quality monitoring dashboard...');

    const dashboard = await this.db.collection('customers').aggregate([
      // Stage 1: Real-time quality analysis
      {
        $addFields: {
          qualityChecks: {
            emailValid: {
              $regexMatch: {
                input: "$primaryContact.email",
                regex: "^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$"
              }
            },
            phoneValid: {
              $regexMatch: {
                input: "$primaryContact.phone", 
                regex: "^\\+?[1-9]\\d{1,14}$"
              }
            },
            addressComplete: {
              $and: [
                { $ne: ["$billingAddress.street1", null] },
                { $ne: ["$billingAddress.city", null] },
                { $ne: ["$billingAddress.state", null] },
                { $ne: ["$billingAddress.postalCode", null] }
              ]
            },
            taxIdPresent: {
              $cond: {
                if: { $in: ["$legalEntityType", ["corporation", "llc"]] },
                then: { $ne: ["$taxId", null] },
                else: true
              }
            },
            dataFresh: {
              $lt: [
                { $subtract: [new Date(), "$audit.updatedAt"] },
                7776000000 // 90 days in milliseconds
              ]
            }
          }
        }
      },

      // Stage 2: Calculate individual record scores
      {
        $addFields: {
          individualQualityScore: {
            $multiply: [
              {
                $divide: [
                  {
                    $add: [
                      { $cond: ["$qualityChecks.emailValid", 20, 0] },
                      { $cond: ["$qualityChecks.phoneValid", 15, 0] },
                      { $cond: ["$qualityChecks.addressComplete", 25, 0] },
                      { $cond: ["$qualityChecks.taxIdPresent", 25, 0] },
                      { $cond: ["$qualityChecks.dataFresh", 15, 0] }
                    ]
                  },
                  100
                ]
              },
              100
            ]
          }
        }
      },

      // Stage 3: Aggregate dashboard metrics
      {
        $group: {
          _id: null,

          // Volume metrics
          totalRecords: { $sum: 1 },
          recordsProcessedToday: {
            $sum: {
              $cond: [
                { $gte: ["$audit.createdAt", new Date(Date.now() - 86400000)] },
                1, 0
              ]
            }
          },

          // Quality distribution
          excellentQuality: {
            $sum: { $cond: [{ $gte: ["$individualQualityScore", 90] }, 1, 0] }
          },
          goodQuality: {
            $sum: { $cond: [
              { $and: [{ $gte: ["$individualQualityScore", 70] }, { $lt: ["$individualQualityScore", 90] }] },
              1, 0
            ]}
          },
          fairQuality: {
            $sum: { $cond: [
              { $and: [{ $gte: ["$individualQualityScore", 50] }, { $lt: ["$individualQualityScore", 70] }] },
              1, 0
            ]}
          },
          poorQuality: {
            $sum: { $cond: [{ $lt: ["$individualQualityScore", 50] }, 1, 0] }
          },

          // Field-specific quality metrics
          validEmails: { $sum: { $cond: ["$qualityChecks.emailValid", 1, 0] } },
          validPhones: { $sum: { $cond: ["$qualityChecks.phoneValid", 1, 0] } },
          completeAddresses: { $sum: { $cond: ["$qualityChecks.addressComplete", 1, 0] } },
          compliantTaxIds: { $sum: { $cond: ["$qualityChecks.taxIdPresent", 1, 0] } },
          freshData: { $sum: { $cond: ["$qualityChecks.dataFresh", 1, 0] } },

          // Quality score statistics
          avgQualityScore: { $avg: "$individualQualityScore" },
          minQualityScore: { $min: "$individualQualityScore" },
          maxQualityScore: { $max: "$individualQualityScore" },

          // Compliance tracking
          gdprComplianceCount: {
            $sum: {
              $cond: [
                {
                  $and: [
                    { $in: ["EU", { $ifNull: ["$compliance.regulatoryJurisdictions", []] }] },
                    { $eq: ["$compliance.gdprConsent.hasConsent", true] },
                    { $ne: ["$compliance.gdprConsent.consentDate", null] }
                  ]
                },
                1, 0
              ]
            }
          },

          // Data freshness metrics
          staleRecordsCount: {
            $sum: { $cond: [{ $not: "$qualityChecks.dataFresh" }, 1, 0] }
          }
        }
      },

      // Stage 4: Calculate percentages and dashboard KPIs
      {
        $addFields: {
          timestamp: new Date(),

          qualityDistribution: {
            excellent: {
              count: "$excellentQuality",
              percentage: { $round: [{ $multiply: [{ $divide: ["$excellentQuality", "$totalRecords"] }, 100] }, 1] }
            },
            good: {
              count: "$goodQuality", 
              percentage: { $round: [{ $multiply: [{ $divide: ["$goodQuality", "$totalRecords"] }, 100] }, 1] }
            },
            fair: {
              count: "$fairQuality",
              percentage: { $round: [{ $multiply: [{ $divide: ["$fairQuality", "$totalRecords"] }, 100] }, 1] }
            },
            poor: {
              count: "$poorQuality",
              percentage: { $round: [{ $multiply: [{ $divide: ["$poorQuality", "$totalRecords"] }, 100] }, 1] }
            }
          },

          fieldQualityRates: {
            emailValidityRate: { $round: [{ $multiply: [{ $divide: ["$validEmails", "$totalRecords"] }, 100] }, 1] },
            phoneValidityRate: { $round: [{ $multiply: [{ $divide: ["$validPhones", "$totalRecords"] }, 100] }, 1] },
            addressCompletenessRate: { $round: [{ $multiply: [{ $divide: ["$completeAddresses", "$totalRecords"] }, 100] }, 1] },
            taxComplianceRate: { $round: [{ $multiply: [{ $divide: ["$compliantTaxIds", "$totalRecords"] }, 100] }, 1] },
            dataFreshnessRate: { $round: [{ $multiply: [{ $divide: ["$freshData", "$totalRecords"] }, 100] }, 1] }
          },

          overallHealthScore: {
            $round: [
              {
                $avg: [
                  { $multiply: [{ $divide: ["$validEmails", "$totalRecords"] }, 100] },
                  { $multiply: [{ $divide: ["$validPhones", "$totalRecords"] }, 100] },
                  { $multiply: [{ $divide: ["$completeAddresses", "$totalRecords"] }, 100] },
                  { $multiply: [{ $divide: ["$compliantTaxIds", "$totalRecords"] }, 100] },
                  { $multiply: [{ $divide: ["$freshData", "$totalRecords"] }, 100] }
                ]
              },
              1
            ]
          },

          alerts: {
            criticalIssues: { $cond: [{ $gt: ["$poorQuality", { $multiply: ["$totalRecords", 0.1] }] }, "High poor quality rate", null] },
            warningIssues: {
              $switch: {
                branches: [
                  { case: { $lt: [{ $multiply: [{ $divide: ["$validEmails", "$totalRecords"] }, 100] }, 90] }, then: "Email validity below 90%" },
                  { case: { $lt: [{ $multiply: [{ $divide: ["$completeAddresses", "$totalRecords"] }, 100] }, 85] }, then: "Address completeness below 85%" },
                  { case: { $gt: ["$staleRecordsCount", { $multiply: ["$totalRecords", 0.2] }] }, then: "Over 20% stale data" }
                ],
                default: null
              }
            }
          }
        }
      }
    ]).toArray();

    const dashboardData = dashboard[0];
    if (dashboardData) {
      // Store dashboard for historical tracking
      await this.db.collection('quality_dashboards').insertOne(dashboardData);
      this.qualityDashboards.set('current', dashboardData);

      // Display dashboard summary
      console.log('\n📊 Real-Time Data Quality Dashboard:');
      console.log(`Overall Health Score: ${dashboardData.overallHealthScore}%`);
      console.log(`Total Records: ${dashboardData.totalRecords?.toLocaleString()}`);
      console.log(`Records Processed Today: ${dashboardData.recordsProcessedToday?.toLocaleString()}`);
      console.log('\nQuality Distribution:');
      console.log(`  Excellent: ${dashboardData.qualityDistribution?.excellent?.count} (${dashboardData.qualityDistribution?.excellent?.percentage}%)`);
      console.log(`  Good: ${dashboardData.qualityDistribution?.good?.count} (${dashboardData.qualityDistribution?.good?.percentage}%)`);
      console.log(`  Fair: ${dashboardData.qualityDistribution?.fair?.count} (${dashboardData.qualityDistribution?.fair?.percentage}%)`);
      console.log(`  Poor: ${dashboardData.qualityDistribution?.poor?.count} (${dashboardData.qualityDistribution?.poor?.percentage}%)`);

      if (dashboardData.alerts?.criticalIssues) {
        console.log(`\n🚨 Critical Alert: ${dashboardData.alerts.criticalIssues}`);
      }
      if (dashboardData.alerts?.warningIssues) {
        console.log(`\n⚠️ Warning: ${dashboardData.alerts.warningIssues}`);
      }
    }

    return dashboardData;
  }

  async automateDataQualityRemediation() {
    console.log('Implementing automated data quality remediation workflows...');

    const remediationRules = [
      {
        name: 'email_standardization',
        condition: { $not: { $regexMatch: { input: "$primaryContact.email", regex: "^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$" } } },
        action: 'flag_for_review',
        priority: 'high'
      },

      {
        name: 'phone_formatting',
        condition: { $not: { $regexMatch: { input: "$primaryContact.phone", regex: "^\\+?[1-9]\\d{1,14}$" } } },
        action: 'auto_format',
        priority: 'medium'
      },

      {
        name: 'missing_tax_id',
        condition: {
          $and: [
            { $in: ["$legalEntityType", ["corporation", "llc"]] },
            { $eq: ["$taxId", null] }
          ]
        },
        action: 'request_completion',
        priority: 'high'
      },

      {
        name: 'stale_data_refresh',
        condition: { $gt: [{ $subtract: [new Date(), "$audit.updatedAt"] }, 15552000000] }, // 180 days
        action: 'schedule_refresh',
        priority: 'low'
      }
    ];

    // Execute remediation workflows
    const remediationResults = [];

    for (const rule of remediationRules) {
      try {
        const affectedDocuments = await this.db.collection('customers').find({
          $expr: rule.condition
        }).limit(1000).toArray();

        if (affectedDocuments.length > 0) {
          const remediation = {
            ruleName: rule.name,
            affectedCount: affectedDocuments.length,
            action: rule.action,
            priority: rule.priority,
            processedAt: new Date(),
            results: []
          };

          // Process based on action type
          for (const doc of affectedDocuments) {
            switch (rule.action) {
              case 'flag_for_review':
                await this.flagForReview(doc._id, rule.name);
                remediation.results.push({ documentId: doc._id, status: 'flagged' });
                break;

              case 'auto_format':
                const formatted = await this.autoFormatData(doc, rule.name);
                if (formatted) {
                  remediation.results.push({ documentId: doc._id, status: 'formatted' });
                }
                break;

              case 'request_completion':
                await this.requestDataCompletion(doc._id, rule.name);
                remediation.results.push({ documentId: doc._id, status: 'completion_requested' });
                break;

              case 'schedule_refresh':
                await this.scheduleDataRefresh(doc._id);
                remediation.results.push({ documentId: doc._id, status: 'refresh_scheduled' });
                break;
            }
          }

          remediationResults.push(remediation);
          console.log(`✅ Processed ${rule.name}: ${remediation.results.length} documents`);
        }

      } catch (error) {
        console.error(`❌ Failed to process rule ${rule.name}:`, error.message);
      }
    }

    // Store remediation audit trail
    if (remediationResults.length > 0) {
      await this.db.collection('remediation_audit').insertOne({
        executionTimestamp: new Date(),
        totalRulesExecuted: remediationRules.length,
        rulesWithMatches: remediationResults.length,
        results: remediationResults,
        executedBy: 'automated_quality_system'
      });
    }

    console.log(`Automated remediation completed: ${remediationResults.length} rules processed`);
    return remediationResults;
  }

  // Helper methods for remediation actions
  async flagForReview(documentId, reason) {
    return await this.db.collection('quality_review_queue').insertOne({
      documentId: documentId,
      reason: reason,
      priority: 'high',
      status: 'pending_review',
      flaggedAt: new Date(),
      assignedTo: null
    });
  }

  async autoFormatData(document, ruleName) {
    // Example: Auto-format phone numbers
    if (ruleName === 'phone_formatting' && document.primaryContact?.phone) {
      const phone = document.primaryContact.phone.replace(/\D/g, '');
      if (phone.length === 10) {
        const formatted = `+1${phone}`;

        await this.db.collection('customers').updateOne(
          { _id: document._id },
          { 
            $set: { 
              "primaryContact.phone": formatted,
              "audit.updatedAt": new Date(),
              "audit.lastAutoFormatted": new Date()
            }
          }
        );

        return true;
      }
    }
    return false;
  }

  async requestDataCompletion(documentId, reason) {
    return await this.db.collection('data_completion_requests').insertOne({
      documentId: documentId,
      reason: reason,
      requestedAt: new Date(),
      status: 'pending',
      priority: 'high',
      assignedTo: null,
      dueDate: new Date(Date.now() + 7 * 24 * 60 * 60 * 1000) // 7 days
    });
  }

  async scheduleDataRefresh(documentId) {
    return await this.db.collection('data_refresh_schedule').insertOne({
      documentId: documentId,
      scheduledFor: new Date(Date.now() + 24 * 60 * 60 * 1000), // Next day
      priority: 'low',
      status: 'scheduled',
      refreshType: 'stale_data_update'
    });
  }
}

// Export the enterprise governance class
module.exports = { EnterpriseDataGovernance };

SQL-Style Schema Validation with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB schema validation and data quality management:

-- QueryLeaf schema validation with SQL-familiar syntax

-- Create collection with comprehensive validation rules
CREATE TABLE customers (
  _id OBJECTID PRIMARY KEY,
  company_name VARCHAR(500) NOT NULL,
  legal_entity_type VARCHAR(50) NOT NULL,

  -- Contact information with validation
  primary_contact JSON NOT NULL CHECK (
    JSON_VALID(primary_contact) AND
    JSON_EXTRACT(primary_contact, '$.email') REGEXP '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$' AND
    JSON_EXTRACT(primary_contact, '$.phone') REGEXP '^\\+?[1-9]\\d{1,14}$'
  ),

  -- Address validation
  billing_address JSON NOT NULL CHECK (
    JSON_VALID(billing_address) AND
    JSON_LENGTH(JSON_EXTRACT(billing_address, '$.street1')) >= 5 AND
    JSON_LENGTH(JSON_EXTRACT(billing_address, '$.city')) >= 2 AND
    JSON_EXTRACT(billing_address, '$.country') IN ('USA', 'CAN', 'MEX', 'GBR', 'FRA', 'DEU')
  ),

  -- Business metrics with constraints
  business_metrics JSON CHECK (
    business_metrics IS NULL OR (
      JSON_VALID(business_metrics) AND
      COALESCE(JSON_EXTRACT(business_metrics, '$.annual_revenue'), 0) >= 0 AND
      COALESCE(JSON_EXTRACT(business_metrics, '$.employee_count'), 1) >= 1
    )
  ),

  account_status VARCHAR(20) NOT NULL DEFAULT 'active',

  -- Compliance fields
  compliance JSON CHECK (
    compliance IS NULL OR (
      JSON_VALID(compliance) AND
      JSON_TYPE(JSON_EXTRACT(compliance, '$.gdpr_consent.has_consent')) = 'BOOLEAN'
    )
  ),

  -- Audit fields
  audit JSON NOT NULL CHECK (
    JSON_VALID(audit) AND
    JSON_EXTRACT(audit, '$.created_at') IS NOT NULL AND
    JSON_EXTRACT(audit, '$.updated_at') IS NOT NULL
  ),

  -- Conditional constraints
  CONSTRAINT chk_legal_entity_tax_id CHECK (
    legal_entity_type NOT IN ('corporation', 'llc') OR 
    JSON_EXTRACT(compliance, '$.tax_id') IS NOT NULL
  ),

  CONSTRAINT chk_public_company_stock_symbol CHECK (
    JSON_EXTRACT(business_metrics, '$.publicly_traded') != TRUE OR
    JSON_EXTRACT(business_metrics, '$.stock_symbol') IS NOT NULL
  ),

  CONSTRAINT chk_gdpr_consent_date CHECK (
    'EU' NOT IN (SELECT value FROM JSON_TABLE(
      COALESCE(JSON_EXTRACT(compliance, '$.regulatory_jurisdictions'), '[]'),
      '$[*]' COLUMNS (value VARCHAR(10) PATH '$')
    ) AS jt) OR (
      JSON_EXTRACT(compliance, '$.gdpr_consent.has_consent') = TRUE AND
      JSON_EXTRACT(compliance, '$.gdpr_consent.consent_date') IS NOT NULL
    )
  )
) WITH (
  collection_options = JSON_OBJECT(
    'validation_level', 'strict',
    'validation_action', 'error'
  )
);

-- Data quality analysis with SQL aggregations
WITH data_quality_metrics AS (
  SELECT 
    _id,
    company_name,

    -- Email validation
    CASE 
      WHEN JSON_EXTRACT(primary_contact, '$.email') REGEXP '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$'
      THEN 1 ELSE 0 
    END as email_valid,

    -- Phone validation  
    CASE
      WHEN JSON_EXTRACT(primary_contact, '$.phone') REGEXP '^\\+?[1-9]\\d{1,14}$'
      THEN 1 ELSE 0
    END as phone_valid,

    -- Address completeness
    CASE
      WHEN JSON_EXTRACT(billing_address, '$.street1') IS NOT NULL AND
           JSON_EXTRACT(billing_address, '$.city') IS NOT NULL AND
           JSON_EXTRACT(billing_address, '$.state') IS NOT NULL AND
           JSON_EXTRACT(billing_address, '$.postal_code') IS NOT NULL
      THEN 1 ELSE 0
    END as address_complete,

    -- Tax compliance
    CASE
      WHEN legal_entity_type NOT IN ('corporation', 'llc') OR
           JSON_EXTRACT(compliance, '$.tax_id') IS NOT NULL
      THEN 1 ELSE 0
    END as tax_compliant,

    -- Data freshness
    CASE
      WHEN TIMESTAMPDIFF(DAY, 
           STR_TO_DATE(JSON_UNQUOTE(JSON_EXTRACT(audit, '$.updated_at')), '%Y-%m-%dT%H:%i:%s.%fZ'),
           NOW()) <= 90
      THEN 1 ELSE 0
    END as data_fresh,

    -- GDPR compliance for EU customers
    CASE
      WHEN 'EU' NOT IN (
        SELECT value FROM JSON_TABLE(
          COALESCE(JSON_EXTRACT(compliance, '$.regulatory_jurisdictions'), '[]'),
          '$[*]' COLUMNS (value VARCHAR(10) PATH '$')
        ) AS jt
      ) OR (
        JSON_EXTRACT(compliance, '$.gdpr_consent.has_consent') = TRUE AND
        JSON_EXTRACT(compliance, '$.gdpr_consent.consent_date') IS NOT NULL
      )
      THEN 1 ELSE 0
    END as gdpr_compliant

  FROM customers
),
quality_scores AS (
  SELECT *,
    -- Calculate overall quality score (0-100)
    (email_valid * 20 + phone_valid * 15 + address_complete * 25 + 
     tax_compliant * 25 + data_fresh * 15) as overall_quality_score,

    -- Quality rating classification
    CASE 
      WHEN (email_valid * 20 + phone_valid * 15 + address_complete * 25 + 
            tax_compliant * 25 + data_fresh * 15) >= 90 THEN 'EXCELLENT'
      WHEN (email_valid * 20 + phone_valid * 15 + address_complete * 25 + 
            tax_compliant * 25 + data_fresh * 15) >= 75 THEN 'GOOD'  
      WHEN (email_valid * 20 + phone_valid * 15 + address_complete * 25 + 
            tax_compliant * 25 + data_fresh * 15) >= 60 THEN 'FAIR'
      ELSE 'POOR'
    END as quality_rating

  FROM data_quality_metrics
)

SELECT 
  -- Summary statistics
  COUNT(*) as total_customers,
  AVG(overall_quality_score) as avg_quality_score,

  -- Quality distribution
  COUNT(*) FILTER (WHERE quality_rating = 'EXCELLENT') as excellent_count,
  COUNT(*) FILTER (WHERE quality_rating = 'GOOD') as good_count,
  COUNT(*) FILTER (WHERE quality_rating = 'FAIR') as fair_count, 
  COUNT(*) FILTER (WHERE quality_rating = 'POOR') as poor_count,

  -- Quality percentages
  ROUND(COUNT(*) FILTER (WHERE quality_rating = 'EXCELLENT') * 100.0 / COUNT(*), 2) as excellent_pct,
  ROUND(COUNT(*) FILTER (WHERE quality_rating = 'GOOD') * 100.0 / COUNT(*), 2) as good_pct,
  ROUND(COUNT(*) FILTER (WHERE quality_rating = 'FAIR') * 100.0 / COUNT(*), 2) as fair_pct,
  ROUND(COUNT(*) FILTER (WHERE quality_rating = 'POOR') * 100.0 / COUNT(*), 2) as poor_pct,

  -- Field-specific quality metrics
  ROUND(AVG(email_valid) * 100, 2) as email_validity_rate,
  ROUND(AVG(phone_valid) * 100, 2) as phone_validity_rate,
  ROUND(AVG(address_complete) * 100, 2) as address_completeness_rate,
  ROUND(AVG(tax_compliant) * 100, 2) as tax_compliance_rate,
  ROUND(AVG(data_fresh) * 100, 2) as data_freshness_rate,
  ROUND(AVG(gdpr_compliant) * 100, 2) as gdpr_compliance_rate,

  -- Data quality health score
  ROUND((AVG(email_valid) + AVG(phone_valid) + AVG(address_complete) + 
         AVG(tax_compliant) + AVG(data_fresh) + AVG(gdpr_compliant)) / 6 * 100, 2) as overall_health_score

FROM quality_scores;

-- Automated data quality monitoring view
CREATE VIEW data_quality_dashboard AS 
WITH real_time_quality AS (
  SELECT 
    DATE_FORMAT(STR_TO_DATE(JSON_UNQUOTE(JSON_EXTRACT(audit, '$.created_at')), 
                '%Y-%m-%dT%H:%i:%s.%fZ'), '%Y-%m-%d %H:00:00') as hour_bucket,

    -- Quality metrics by hour
    COUNT(*) as records_processed,

    AVG(CASE WHEN JSON_EXTRACT(primary_contact, '$.email') REGEXP '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$' 
             THEN 1 ELSE 0 END) as email_validity_rate,

    AVG(CASE WHEN JSON_EXTRACT(primary_contact, '$.phone') REGEXP '^\\+?[1-9]\\d{1,14}$'
             THEN 1 ELSE 0 END) as phone_validity_rate,

    AVG(CASE WHEN JSON_EXTRACT(billing_address, '$.street1') IS NOT NULL AND
                   JSON_EXTRACT(billing_address, '$.city') IS NOT NULL
             THEN 1 ELSE 0 END) as address_completeness_rate,

    -- Compliance rates
    AVG(CASE WHEN legal_entity_type NOT IN ('corporation', 'llc') OR 
                   JSON_EXTRACT(compliance, '$.tax_id') IS NOT NULL
             THEN 1 ELSE 0 END) as tax_compliance_rate,

    -- Alert conditions
    COUNT(*) FILTER (WHERE 
      JSON_EXTRACT(primary_contact, '$.email') NOT REGEXP '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$'
    ) as invalid_email_count,

    COUNT(*) FILTER (WHERE
      legal_entity_type IN ('corporation', 'llc') AND
      JSON_EXTRACT(compliance, '$.tax_id') IS NULL
    ) as missing_tax_id_count

  FROM customers
  WHERE STR_TO_DATE(JSON_UNQUOTE(JSON_EXTRACT(audit, '$.created_at')), 
                    '%Y-%m-%dT%H:%i:%s.%fZ') >= DATE_SUB(NOW(), INTERVAL 24 HOUR)
  GROUP BY DATE_FORMAT(STR_TO_DATE(JSON_UNQUOTE(JSON_EXTRACT(audit, '$.created_at')), 
                       '%Y-%m-%dT%H:%i:%s.%fZ'), '%Y-%m-%d %H:00:00')
)

SELECT 
  hour_bucket as monitoring_hour,
  records_processed,
  ROUND(email_validity_rate * 100, 2) as email_validity_pct,
  ROUND(phone_validity_rate * 100, 2) as phone_validity_pct,
  ROUND(address_completeness_rate * 100, 2) as address_completeness_pct,
  ROUND(tax_compliance_rate * 100, 2) as tax_compliance_pct,

  -- Overall quality score for the hour
  ROUND((email_validity_rate + phone_validity_rate + address_completeness_rate + tax_compliance_rate) / 4 * 100, 2) as hourly_quality_score,

  -- Issue counts
  invalid_email_count,
  missing_tax_id_count,

  -- Alert status
  CASE 
    WHEN invalid_email_count > records_processed * 0.1 THEN '🔴 High Invalid Email Rate'
    WHEN missing_tax_id_count > 0 THEN '🟠 Missing Tax IDs'
    WHEN (email_validity_rate + phone_validity_rate + address_completeness_rate + tax_compliance_rate) / 4 < 0.8 THEN '🟡 Below Quality Threshold'
    ELSE '🟢 Quality Within Target'
  END as quality_status,

  -- Recommendations
  CASE
    WHEN invalid_email_count > records_processed * 0.05 THEN 'Implement email validation at point of entry'
    WHEN missing_tax_id_count > 5 THEN 'Review tax ID collection process'  
    WHEN address_completeness_rate < 0.9 THEN 'Improve address validation workflow'
    ELSE 'Monitor quality trends'
  END as recommendation

FROM real_time_quality  
ORDER BY hour_bucket DESC;

-- Data quality remediation workflow
WITH quality_issues AS (
  SELECT 
    _id,
    company_name,

    -- Identify specific issues
    CASE WHEN JSON_EXTRACT(primary_contact, '$.email') NOT REGEXP '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$'
         THEN 'INVALID_EMAIL' END as email_issue,

    CASE WHEN JSON_EXTRACT(primary_contact, '$.phone') NOT REGEXP '^\\+?[1-9]\\d{1,14}$'
         THEN 'INVALID_PHONE' END as phone_issue,

    CASE WHEN JSON_EXTRACT(billing_address, '$.street1') IS NULL OR 
              JSON_EXTRACT(billing_address, '$.city') IS NULL
         THEN 'INCOMPLETE_ADDRESS' END as address_issue,

    CASE WHEN legal_entity_type IN ('corporation', 'llc') AND
              JSON_EXTRACT(compliance, '$.tax_id') IS NULL
         THEN 'MISSING_TAX_ID' END as tax_issue,

    -- Priority calculation
    CASE 
      WHEN legal_entity_type IN ('corporation', 'llc') AND 
           JSON_EXTRACT(compliance, '$.tax_id') IS NULL THEN 'HIGH'
      WHEN JSON_EXTRACT(primary_contact, '$.email') NOT REGEXP '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$' THEN 'HIGH'
      WHEN JSON_EXTRACT(billing_address, '$.street1') IS NULL THEN 'MEDIUM'
      ELSE 'LOW'
    END as issue_priority

  FROM customers
  WHERE 
    JSON_EXTRACT(primary_contact, '$.email') NOT REGEXP '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$' OR
    JSON_EXTRACT(primary_contact, '$.phone') NOT REGEXP '^\\+?[1-9]\\d{1,14}$' OR
    JSON_EXTRACT(billing_address, '$.street1') IS NULL OR
    JSON_EXTRACT(billing_address, '$.city') IS NULL OR
    (legal_entity_type IN ('corporation', 'llc') AND JSON_EXTRACT(compliance, '$.tax_id') IS NULL)
)

SELECT 
  _id as customer_id,
  company_name,

  -- Consolidate issues
  CONCAT_WS(', ', 
    email_issue,
    phone_issue, 
    address_issue,
    tax_issue
  ) as identified_issues,

  issue_priority,

  -- Recommended actions
  CASE issue_priority
    WHEN 'HIGH' THEN 'Immediate manual review and correction required'
    WHEN 'MEDIUM' THEN 'Schedule for data completion workflow'
    WHEN 'LOW' THEN 'Include in next batch quality improvement'
  END as recommended_action,

  -- Auto-remediation possibility
  CASE 
    WHEN phone_issue = 'INVALID_PHONE' AND 
         JSON_EXTRACT(primary_contact, '$.phone') REGEXP '^[0-9]{10}$' THEN 'AUTO_FORMAT_PHONE'
    WHEN address_issue = 'INCOMPLETE_ADDRESS' AND
         JSON_EXTRACT(billing_address, '$.street1') IS NOT NULL THEN 'REQUEST_COMPLETION'
    ELSE 'MANUAL_REVIEW'
  END as remediation_type,

  NOW() as identified_at

FROM quality_issues
ORDER BY 
  CASE issue_priority 
    WHEN 'HIGH' THEN 1 
    WHEN 'MEDIUM' THEN 2 
    ELSE 3 
  END,
  company_name;

-- QueryLeaf provides comprehensive schema validation capabilities:
-- 1. SQL-familiar constraint syntax for MongoDB document validation
-- 2. Advanced JSON validation with nested field constraints
-- 3. Conditional validation rules based on document context
-- 4. Real-time data quality monitoring with SQL aggregations
-- 5. Automated quality scoring and rating classification
-- 6. Data quality dashboard views with trend analysis
-- 7. Compliance reporting with regulatory requirement tracking
-- 8. Quality issue identification and remediation workflows
-- 9. Integration with MongoDB's native validation features
-- 10. Familiar SQL patterns for complex data governance requirements

Best Practices for Schema Validation Implementation

Validation Strategy Design

Essential practices for effective MongoDB schema validation:

Progressive Validation: Start with warning-level validation and gradually enforce strict rules
Conditional Logic: Use document context to apply appropriate validation rules
Business Rule Integration: Align validation rules with actual business requirements
Performance Consideration: Balance validation thoroughness with write performance
Error Messaging: Provide clear, actionable error messages for validation failures
Version Management: Plan for schema evolution and backward compatibility

Data Quality Management

Implement comprehensive data quality monitoring for production environments:

Continuous Monitoring: Track data quality metrics in real-time with automated dashboards
Quality Scoring: Develop standardized quality scores across different document types
Remediation Workflows: Implement automated and manual remediation processes
Compliance Tracking: Monitor regulatory compliance requirements continuously
Historical Analysis: Track data quality trends over time for improvement insights
Integration Patterns: Coordinate validation across multiple data sources and systems

Conclusion

MongoDB Schema Validation provides comprehensive data quality management capabilities that eliminate the complexity and rigidity of traditional database constraint systems. The combination of flexible document validation, sophisticated business rule enforcement, and automated quality monitoring enables enterprise-grade data governance that adapts to evolving requirements while maintaining strict compliance standards.

Key Schema Validation benefits include:

Flexible Validation: Document-based validation that adapts to varying data structures and requirements
Business Logic Integration: Advanced conditional validation based on document context and business rules
Automated Quality Management: Real-time quality monitoring with automated remediation workflows
Compliance Reporting: Comprehensive regulatory compliance tracking and audit capabilities
Performance Optimization: Efficient validation that scales with data volume and complexity
Developer Productivity: SQL-familiar validation patterns that reduce implementation complexity

Whether you're building financial services applications, healthcare systems, e-commerce platforms, or any enterprise application requiring strict data quality standards, MongoDB Schema Validation with QueryLeaf's SQL-familiar interface provides the foundation for robust data governance. This combination enables sophisticated validation strategies while preserving familiar database interaction patterns.

QueryLeaf Integration: QueryLeaf automatically translates SQL constraint definitions into MongoDB validation schemas, providing familiar CREATE TABLE syntax with CHECK constraints, conditional validation rules, and data quality monitoring queries. Advanced validation patterns, compliance reporting, and automated remediation workflows are seamlessly accessible through SQL constructs, making enterprise data governance both powerful and approachable for SQL-oriented teams.

The integration of flexible validation capabilities with SQL-style operations makes MongoDB an ideal platform for applications requiring both strict data quality enforcement and adaptive schema evolution, ensuring your data governance solutions remain both effective and maintainable as requirements evolve and data volumes scale.

November 15, 2025
25 min read

MongoDB Time-Series Collections for IoT Analytics: High-Performance Data Processing and Real-Time Analytics with SQL-Compatible Operations

Modern IoT applications generate massive volumes of time-stamped sensor data, requiring specialized database architectures that can efficiently ingest, store, and analyze temporal data at scale. Traditional relational databases struggle with the unique characteristics of time-series workloads: high write throughput, time-based queries, and analytical operations across large temporal ranges.

MongoDB Time-Series Collections provide native support for temporal data patterns with optimized storage engines, intelligent compression algorithms, and high-performance analytical capabilities specifically designed for IoT, monitoring, and analytics use cases. Unlike generic document collections or traditional time-series databases that require complex sharding strategies, Time-Series Collections automatically optimize storage layout, indexing, and query execution for temporal data patterns.

The Traditional Time-Series Data Challenge

Managing time-series data with conventional database approaches creates significant performance and operational challenges:

-- Traditional PostgreSQL time-series implementation - complex partitioning and maintenance

-- Sensor readings table with time-based partitioning
CREATE TABLE sensor_readings (
    reading_id BIGSERIAL,
    device_id VARCHAR(50) NOT NULL,
    sensor_type VARCHAR(50) NOT NULL,
    timestamp TIMESTAMPTZ NOT NULL,
    value NUMERIC(10,4) NOT NULL,
    quality_score SMALLINT DEFAULT 100,

    -- Location and device metadata
    device_location VARCHAR(100),
    facility_id VARCHAR(50),
    building_id VARCHAR(50),
    floor_id VARCHAR(50),

    -- Environmental context
    ambient_temperature NUMERIC(5,2),
    humidity_percent NUMERIC(5,2),
    atmospheric_pressure NUMERIC(7,2),

    -- System metadata
    ingestion_timestamp TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
    data_source VARCHAR(50),
    processing_pipeline_version VARCHAR(20),

    -- Constraint for partitioning
    PRIMARY KEY (reading_id, timestamp)
) PARTITION BY RANGE (timestamp);

-- Create monthly partitions (requires ongoing maintenance)
CREATE TABLE sensor_readings_2025_01 PARTITION OF sensor_readings
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE TABLE sensor_readings_2025_02 PARTITION OF sensor_readings
    FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');
CREATE TABLE sensor_readings_2025_03 PARTITION OF sensor_readings
    FOR VALUES FROM ('2025-03-01') TO ('2025-04-01');
-- ... (requires creating new partitions monthly)

-- Indexes for time-series query patterns
CREATE INDEX idx_readings_device_time ON sensor_readings (device_id, timestamp DESC);
CREATE INDEX idx_readings_type_time ON sensor_readings (sensor_type, timestamp DESC);
CREATE INDEX idx_readings_facility_time ON sensor_readings (facility_id, timestamp DESC);
CREATE INDEX idx_readings_timestamp ON sensor_readings (timestamp DESC);

-- Sensor metadata table for device information
CREATE TABLE sensor_devices (
    device_id VARCHAR(50) PRIMARY KEY,
    device_name VARCHAR(200) NOT NULL,
    device_type VARCHAR(100) NOT NULL,
    manufacturer VARCHAR(100),
    model VARCHAR(100),
    firmware_version VARCHAR(50),

    -- Installation details
    installation_date DATE NOT NULL,
    location_description TEXT,
    coordinates POINT,

    -- Configuration
    sampling_interval_seconds INTEGER DEFAULT 300,
    measurement_units JSONB,
    calibration_data JSONB,
    alert_thresholds JSONB,

    -- Status tracking
    is_active BOOLEAN DEFAULT true,
    last_communication TIMESTAMPTZ,
    battery_level_percent SMALLINT,
    signal_strength_dbm INTEGER,

    created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP
);

-- Complex time-series aggregation query with window functions
WITH hourly_aggregations AS (
    SELECT 
        device_id,
        sensor_type,
        facility_id,
        DATE_TRUNC('hour', timestamp) as hour_bucket,

        -- Statistical aggregations
        COUNT(*) as reading_count,
        AVG(value) as avg_value,
        MIN(value) as min_value,
        MAX(value) as max_value,
        STDDEV(value) as stddev_value,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value) as median_value,
        PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY value) as p95_value,

        -- Quality metrics
        AVG(quality_score) as avg_quality,
        COUNT(*) FILTER (WHERE quality_score < 80) as poor_quality_count,

        -- Environmental correlations
        AVG(ambient_temperature) as avg_ambient_temp,
        AVG(humidity_percent) as avg_humidity,

        -- Time-based metrics
        MAX(timestamp) as latest_reading,
        MIN(timestamp) as earliest_reading

    FROM sensor_readings
    WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
        AND timestamp < CURRENT_TIMESTAMP
        AND quality_score >= 50  -- Filter out very poor quality readings
    GROUP BY device_id, sensor_type, facility_id, DATE_TRUNC('hour', timestamp)
),

device_performance_metrics AS (
    SELECT 
        ha.*,
        sd.device_name,
        sd.device_type,
        sd.manufacturer,
        sd.sampling_interval_seconds,
        sd.location_description,

        -- Performance calculations
        CASE 
            WHEN ha.reading_count < (3600 / sd.sampling_interval_seconds) * 0.8 THEN 'Poor'
            WHEN ha.reading_count < (3600 / sd.sampling_interval_seconds) * 0.95 THEN 'Fair'  
            ELSE 'Good'
        END as data_completeness,

        -- Anomaly detection using z-score
        ABS(ha.avg_value - LAG(ha.avg_value, 1) OVER (
            PARTITION BY ha.device_id, ha.sensor_type 
            ORDER BY ha.hour_bucket
        )) / NULLIF(ha.stddev_value, 0) as hour_over_hour_zscore,

        -- Rate of change analysis
        (ha.avg_value - LAG(ha.avg_value, 1) OVER (
            PARTITION BY ha.device_id, ha.sensor_type 
            ORDER BY ha.hour_bucket
        )) as hour_over_hour_change,

        -- Moving averages for trend analysis
        AVG(ha.avg_value) OVER (
            PARTITION BY ha.device_id, ha.sensor_type 
            ORDER BY ha.hour_bucket 
            ROWS BETWEEN 23 PRECEDING AND CURRENT ROW
        ) as moving_avg_24h,

        -- Deviation from baseline
        ABS(ha.avg_value - AVG(ha.avg_value) OVER (
            PARTITION BY ha.device_id, ha.sensor_type, EXTRACT(hour FROM ha.hour_bucket)
        )) as deviation_from_baseline

    FROM hourly_aggregations ha
    JOIN sensor_devices sd ON ha.device_id = sd.device_id
),

alert_analysis AS (
    SELECT 
        dpm.*,

        -- Alert conditions
        CASE 
            WHEN dpm.data_completeness = 'Poor' THEN 'Data Availability Alert'
            WHEN dpm.hour_over_hour_zscore > 3 THEN 'Anomaly Alert'
            WHEN dpm.avg_quality < 70 THEN 'Data Quality Alert'
            WHEN dpm.deviation_from_baseline > dpm.stddev_value * 2 THEN 'Baseline Deviation Alert'
            ELSE NULL
        END as alert_type,

        -- Alert priority
        CASE 
            WHEN dpm.data_completeness = 'Poor' AND dpm.avg_quality < 60 THEN 'Critical'
            WHEN dpm.hour_over_hour_zscore > 4 THEN 'High'
            WHEN dpm.deviation_from_baseline > dpm.stddev_value * 3 THEN 'High'
            WHEN dpm.data_completeness = 'Fair' THEN 'Medium'
            ELSE 'Low'
        END as alert_priority

    FROM device_performance_metrics dpm
)

SELECT 
    device_id,
    device_name,
    sensor_type,
    facility_id,
    TO_CHAR(hour_bucket, 'YYYY-MM-DD HH24:00') as analysis_hour,

    -- Core metrics
    reading_count,
    ROUND(avg_value::numeric, 3) as average_value,
    ROUND(min_value::numeric, 3) as minimum_value,
    ROUND(max_value::numeric, 3) as maximum_value,
    ROUND(stddev_value::numeric, 3) as std_deviation,
    ROUND(median_value::numeric, 3) as median_value,

    -- Performance indicators
    data_completeness,
    ROUND(avg_quality::numeric, 1) as average_quality_score,
    poor_quality_count,

    -- Analytical insights
    ROUND(hour_over_hour_change::numeric, 4) as hourly_change,
    ROUND(hour_over_hour_zscore::numeric, 2) as change_zscore,
    ROUND(moving_avg_24h::numeric, 3) as daily_moving_average,
    ROUND(deviation_from_baseline::numeric, 3) as baseline_deviation,

    -- Environmental factors
    ROUND(avg_ambient_temp::numeric, 1) as ambient_temperature,
    ROUND(avg_humidity::numeric, 1) as humidity_percent,

    -- Alerts and notifications
    alert_type,
    alert_priority,

    -- Data freshness
    EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - latest_reading)) / 60 as minutes_since_last_reading

FROM alert_analysis
WHERE hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
ORDER BY 
    facility_id, 
    device_id, 
    sensor_type,
    hour_bucket DESC;

-- Problems with traditional time-series approaches:
-- 1. Complex manual partitioning requiring ongoing maintenance and planning
-- 2. Limited compression and storage optimization for temporal data patterns
-- 3. Expensive analytical queries across large time ranges and multiple partitions
-- 4. Manual index management for various time-based query patterns
-- 5. Difficult schema evolution as IoT requirements change
-- 6. Limited support for hierarchical time-based aggregations
-- 7. Complex data lifecycle management and archival strategies
-- 8. Poor performance for high-frequency data ingestion and concurrent analytics
-- 9. Expensive infrastructure scaling for time-series workloads
-- 10. Limited real-time aggregation capabilities for streaming analytics

MongoDB Time-Series Collections provide optimized temporal data management:

// MongoDB Time-Series Collections - native high-performance temporal data management
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('iot_analytics_platform');

// Advanced Time-Series Collection Management
class MongoTimeSeriesManager {
  constructor(db) {
    this.db = db;
    this.collections = new Map();
    this.aggregationPipelines = new Map();
    this.realtimeStreams = new Map();
  }

  async createOptimizedTimeSeriesCollections() {
    console.log('Creating optimized time-series collections for IoT analytics...');

    // Primary sensor readings collection
    const sensorReadingsSpec = {
      timeseries: {
        timeField: "timestamp",
        metaField: "device",
        granularity: "minutes"  // Optimizes for minute-level bucketing
      },
      expireAfterSeconds: 60 * 60 * 24 * 365 * 2  // 2 years retention
    };

    await this.db.createCollection('sensor_readings', sensorReadingsSpec);

    // Device heartbeat collection (higher frequency data)
    const heartbeatSpec = {
      timeseries: {
        timeField: "timestamp", 
        metaField: "device",
        granularity: "seconds"  // Optimizes for second-level data
      },
      expireAfterSeconds: 60 * 60 * 24 * 30  // 30 days retention
    };

    await this.db.createCollection('device_heartbeat', heartbeatSpec);

    // Aggregated analytics collection (lower frequency, longer retention)
    const analyticsSpec = {
      timeseries: {
        timeField: "window_start",
        metaField: "aggregation_metadata", 
        granularity: "hours"  // Optimizes for hourly aggregations
      },
      expireAfterSeconds: 60 * 60 * 24 * 365 * 5  // 5 years retention
    };

    await this.db.createCollection('analytics_aggregations', analyticsSpec);

    // Create collections references
    this.collections.set('readings', this.db.collection('sensor_readings'));
    this.collections.set('heartbeat', this.db.collection('device_heartbeat'));
    this.collections.set('analytics', this.db.collection('analytics_aggregations'));

    // Create supporting indexes for efficient queries
    await this.createTimeSeriesIndexes();

    console.log('✅ Time-series collections created with optimal configuration');
    return this.collections;
  }

  async createTimeSeriesIndexes() {
    console.log('Creating optimized indexes for time-series query patterns...');

    const readingsCollection = this.collections.get('readings');

    // Compound indexes for common query patterns
    await readingsCollection.createIndexes([
      {
        key: { "device.device_id": 1, "timestamp": -1 },
        name: "idx_device_time_desc",
        background: true
      },
      {
        key: { "device.sensor_type": 1, "timestamp": -1 }, 
        name: "idx_sensor_type_time",
        background: true
      },
      {
        key: { "device.facility_id": 1, "device.sensor_type": 1, "timestamp": -1 },
        name: "idx_facility_sensor_time",
        background: true
      },
      {
        key: { "measurements.value": 1, "timestamp": -1 },
        name: "idx_value_time_range",
        background: true
      }
    ]);

    console.log('✅ Time-series indexes created');
  }

  async ingestSensorData(sensorReadings) {
    console.log(`Ingesting ${sensorReadings.length} sensor readings...`);

    const readingsCollection = this.collections.get('readings');
    const batchSize = 10000;
    let totalIngested = 0;

    // Process readings in optimized batches
    for (let i = 0; i < sensorReadings.length; i += batchSize) {
      const batch = sensorReadings.slice(i, i + batchSize);

      try {
        // Transform readings to time-series document format
        const timeSeriesDocuments = batch.map(reading => ({
          timestamp: new Date(reading.timestamp),

          // Device metadata (metaField)
          device: {
            device_id: reading.device_id,
            sensor_type: reading.sensor_type,
            facility_id: reading.facility_id,
            building_id: reading.building_id,
            floor_id: reading.floor_id,
            location: reading.location,

            // Device specifications
            manufacturer: reading.manufacturer,
            model: reading.model,
            firmware_version: reading.firmware_version,

            // Operational context
            sampling_interval: reading.sampling_interval,
            calibration_date: reading.calibration_date,
            maintenance_schedule: reading.maintenance_schedule
          },

          // Measurement data (time-varying fields)
          measurements: {
            value: reading.value,
            unit: reading.unit,
            quality_score: reading.quality_score || 100,

            // Multiple sensor values (for multi-sensor devices)
            ...(reading.secondary_values && {
              secondary_measurements: reading.secondary_values
            })
          },

          // Environmental context
          environment: {
            ambient_temperature: reading.ambient_temperature,
            humidity: reading.humidity,
            atmospheric_pressure: reading.atmospheric_pressure,
            light_level: reading.light_level,
            noise_level: reading.noise_level
          },

          // System metadata
          system: {
            ingestion_timestamp: new Date(),
            data_source: reading.data_source || 'iot-gateway',
            processing_pipeline: reading.processing_pipeline || 'v1.0',
            batch_id: reading.batch_id,

            // Quality indicators
            transmission_latency_ms: reading.transmission_latency_ms,
            signal_strength: reading.signal_strength,
            battery_level: reading.battery_level
          },

          // Derived analytics (computed during ingestion)
          analytics: {
            is_anomaly: this.detectSimpleAnomaly(reading),
            trend_direction: this.calculateTrendDirection(reading),
            data_completeness_score: this.calculateCompletenessScore(reading)
          }
        }));

        // Bulk insert with ordered: false for better performance
        const result = await readingsCollection.insertMany(timeSeriesDocuments, {
          ordered: false,
          writeConcern: { w: 1, j: false }  // Optimized for throughput
        });

        totalIngested += result.insertedCount;

        if (i % (batchSize * 10) === 0) {
          console.log(`Ingested ${totalIngested}/${sensorReadings.length} readings...`);
        }

      } catch (error) {
        console.error(`Error ingesting batch starting at index ${i}:`, error);
        continue;
      }
    }

    console.log(`✅ Ingestion completed: ${totalIngested}/${sensorReadings.length} readings`);
    return { totalIngested, totalReceived: sensorReadings.length };
  }

  async performRealTimeAnalytics(deviceId, timeRange = '1h', options = {}) {
    console.log(`Performing real-time analytics for device ${deviceId}...`);

    const {
      aggregationLevel = 'minute',
      includeAnomalyDetection = true,
      calculateTrends = true,
      environmentalCorrelation = true
    } = options;

    const readingsCollection = this.collections.get('readings');
    const endTime = new Date();
    const startTime = new Date(endTime.getTime() - this.parseTimeRange(timeRange));

    try {
      const analyticalPipeline = [
        // Stage 1: Filter by device and time range
        {
          $match: {
            "device.device_id": deviceId,
            "timestamp": {
              $gte: startTime,
              $lte: endTime
            },
            "measurements.quality_score": { $gte: 70 }  // Filter poor quality data
          }
        },

        // Stage 2: Time-based bucketing
        {
          $group: {
            _id: {
              time_bucket: {
                $dateTrunc: {
                  date: "$timestamp",
                  unit: aggregationLevel,
                  ...(aggregationLevel === 'minute' && { binSize: 5 })  // 5-minute buckets
                }
              },
              sensor_type: "$device.sensor_type",
              facility_id: "$device.facility_id"
            },

            // Statistical aggregations
            reading_count: { $sum: 1 },
            avg_value: { $avg: "$measurements.value" },
            min_value: { $min: "$measurements.value" },
            max_value: { $max: "$measurements.value" },
            sum_value: { $sum: "$measurements.value" },

            // Advanced statistical measures
            values: { $push: "$measurements.value" },  // For percentile calculations

            // Quality metrics
            avg_quality: { $avg: "$measurements.quality_score" },
            poor_quality_count: {
              $sum: {
                $cond: [{ $lt: ["$measurements.quality_score", 80] }, 1, 0]
              }
            },

            // Environmental correlations
            avg_ambient_temp: { $avg: "$environment.ambient_temperature" },
            avg_humidity: { $avg: "$environment.humidity" },
            avg_pressure: { $avg: "$environment.atmospheric_pressure" },

            // System health indicators
            avg_signal_strength: { $avg: "$system.signal_strength" },
            avg_battery_level: { $avg: "$system.battery_level" },
            avg_transmission_latency: { $avg: "$system.transmission_latency_ms" },

            // Time boundaries
            first_timestamp: { $min: "$timestamp" },
            last_timestamp: { $max: "$timestamp" },

            // Device metadata (take first occurrence)
            device_metadata: { $first: "$device" }
          }
        },

        // Stage 3: Calculate advanced statistics
        {
          $addFields: {
            // Statistical measures
            value_range: { $subtract: ["$max_value", "$min_value"] },
            data_completeness: {
              $divide: [
                "$reading_count",
                { $divide: [
                  { $subtract: ["$last_timestamp", "$first_timestamp"] },
                  1000 * 60 * (aggregationLevel === 'minute' ? 5 : 1)  // Expected interval
                ]}
              ]
            },

            // Percentile calculations (approximated)
            median_value: {
              $arrayElemAt: [
                { $sortArray: { input: "$values", sortBy: 1 } },
                { $floor: { $multiply: [{ $size: "$values" }, 0.5] } }
              ]
            },
            p95_value: {
              $arrayElemAt: [
                { $sortArray: { input: "$values", sortBy: 1 } },
                { $floor: { $multiply: [{ $size: "$values" }, 0.95] } }
              ]
            },

            // Quality scoring
            quality_score: {
              $multiply: [
                { $divide: ["$avg_quality", 100] },
                { $min: ["$data_completeness", 1] }
              ]
            }
          }
        },

        // Stage 4: Add time-based analytical features
        {
          $setWindowFields: {
            partitionBy: { 
              sensor_type: "$_id.sensor_type",
              facility_id: "$_id.facility_id"
            },
            sortBy: { "_id.time_bucket": 1 },
            output: {
              // Moving averages
              moving_avg_3: {
                $avg: "$avg_value",
                window: { range: [-2, 0], unit: "position" }
              },
              moving_avg_6: {
                $avg: "$avg_value", 
                window: { range: [-5, 0], unit: "position" }
              },

              // Rate of change
              previous_avg: {
                $shift: { 
                  output: "$avg_value", 
                  by: -1 
                }
              },

              // Trend analysis
              trend_slope: {
                $linearFill: "$avg_value"
              }
            }
          }
        },

        // Stage 5: Calculate derived analytics
        {
          $addFields: {
            // Rate of change calculations
            rate_of_change: {
              $cond: {
                if: { $ne: ["$previous_avg", null] },
                then: { $subtract: ["$avg_value", "$previous_avg"] },
                else: 0
              }
            },

            // Anomaly detection (simple z-score based)
            is_potential_anomaly: {
              $gt: [
                { $abs: { $subtract: ["$avg_value", "$moving_avg_6"] } },
                { $multiply: [{ $sqrt: "$value_range" }, 2] }  // Simple threshold
              ]
            },

            // Trend classification
            trend_direction: {
              $switch: {
                branches: [
                  { 
                    case: { $gt: ["$rate_of_change", { $multiply: ["$value_range", 0.05] }] },
                    then: "increasing"
                  },
                  { 
                    case: { $lt: ["$rate_of_change", { $multiply: ["$value_range", -0.05] }] },
                    then: "decreasing" 
                  }
                ],
                default: "stable"
              }
            },

            // Performance classification
            performance_status: {
              $switch: {
                branches: [
                  {
                    case: { 
                      $and: [
                        { $gte: ["$quality_score", 0.9] },
                        { $gte: ["$data_completeness", 0.95] }
                      ]
                    },
                    then: "excellent"
                  },
                  {
                    case: {
                      $and: [
                        { $gte: ["$quality_score", 0.7] },
                        { $gte: ["$data_completeness", 0.8] }
                      ]
                    },
                    then: "good"
                  },
                  {
                    case: {
                      $or: [
                        { $lt: ["$quality_score", 0.5] },
                        { $lt: ["$data_completeness", 0.6] }
                      ]
                    },
                    then: "poor"
                  }
                ],
                default: "fair"
              }
            }
          }
        },

        // Stage 6: Final projection and formatting
        {
          $project: {
            _id: 0,
            time_bucket: "$_id.time_bucket",
            sensor_type: "$_id.sensor_type",
            facility_id: "$_id.facility_id",
            device_id: deviceId,

            // Core metrics
            reading_count: 1,
            avg_value: { $round: ["$avg_value", 3] },
            min_value: { $round: ["$min_value", 3] },
            max_value: { $round: ["$max_value", 3] },
            median_value: { $round: ["$median_value", 3] },
            p95_value: { $round: ["$p95_value", 3] },
            value_range: { $round: ["$value_range", 3] },

            // Quality and completeness
            data_completeness: { $round: ["$data_completeness", 3] },
            quality_score: { $round: ["$quality_score", 3] },
            poor_quality_count: 1,

            // Analytical insights
            moving_avg_3: { $round: ["$moving_avg_3", 3] },
            moving_avg_6: { $round: ["$moving_avg_6", 3] },
            rate_of_change: { $round: ["$rate_of_change", 4] },
            trend_direction: 1,
            is_potential_anomaly: 1,
            performance_status: 1,

            // Environmental factors
            environmental_context: {
              ambient_temperature: { $round: ["$avg_ambient_temp", 1] },
              humidity: { $round: ["$avg_humidity", 1] },
              atmospheric_pressure: { $round: ["$avg_pressure", 1] }
            },

            // System health
            system_health: {
              signal_strength: { $round: ["$avg_signal_strength", 1] },
              battery_level: { $round: ["$avg_battery_level", 1] },
              transmission_latency: { $round: ["$avg_transmission_latency", 1] }
            },

            // Time boundaries
            time_range: {
              start: "$first_timestamp",
              end: "$last_timestamp",
              duration_minutes: {
                $round: [
                  { $divide: [
                    { $subtract: ["$last_timestamp", "$first_timestamp"] },
                    60000
                  ]}, 
                  1
                ]
              }
            },

            // Device context
            device_metadata: "$device_metadata"
          }
        },

        // Stage 7: Sort by time
        {
          $sort: { "time_bucket": 1 }
        }
      ];

      const startAnalysis = Date.now();
      const analyticsResults = await readingsCollection.aggregate(analyticalPipeline, {
        allowDiskUse: true,
        maxTimeMS: 30000  // 30 second timeout
      }).toArray();
      const analysisTime = Date.now() - startAnalysis;

      console.log(`✅ Real-time analytics completed in ${analysisTime}ms`);
      console.log(`Generated ${analyticsResults.length} analytical data points`);

      // Calculate summary statistics
      const summary = this.calculateAnalyticsSummary(analyticsResults);

      return {
        deviceId,
        timeRange,
        analysisTime: analysisTime,
        dataPoints: analyticsResults.length,
        analytics: analyticsResults,
        summary: summary
      };

    } catch (error) {
      console.error('Error performing real-time analytics:', error);
      throw error;
    }
  }

  calculateAnalyticsSummary(analyticsResults) {
    if (analyticsResults.length === 0) return {};

    const summary = {
      totalReadings: analyticsResults.reduce((sum, point) => sum + point.reading_count, 0),
      averageQuality: analyticsResults.reduce((sum, point) => sum + point.quality_score, 0) / analyticsResults.length,
      averageCompleteness: analyticsResults.reduce((sum, point) => sum + point.data_completeness, 0) / analyticsResults.length,

      anomalyCount: analyticsResults.filter(point => point.is_potential_anomaly).length,
      trendDistribution: {
        increasing: analyticsResults.filter(p => p.trend_direction === 'increasing').length,
        decreasing: analyticsResults.filter(p => p.trend_direction === 'decreasing').length, 
        stable: analyticsResults.filter(p => p.trend_direction === 'stable').length
      },

      performanceDistribution: {
        excellent: analyticsResults.filter(p => p.performance_status === 'excellent').length,
        good: analyticsResults.filter(p => p.performance_status === 'good').length,
        fair: analyticsResults.filter(p => p.performance_status === 'fair').length,
        poor: analyticsResults.filter(p => p.performance_status === 'poor').length
      }
    };

    return summary;
  }

  async createRealTimeAggregations() {
    console.log('Setting up real-time aggregation pipelines...');

    const readingsCollection = this.collections.get('readings');
    const analyticsCollection = this.collections.get('analytics');

    // Create change stream for real-time processing
    const changeStream = readingsCollection.watch([
      {
        $match: {
          'fullDocument.measurements.quality_score': { $gte: 80 }
        }
      }
    ], {
      fullDocument: 'updateLookup'
    });

    changeStream.on('change', async (change) => {
      if (change.operationType === 'insert') {
        await this.processRealTimeUpdate(change.fullDocument);
      }
    });

    this.realtimeStreams.set('readings_processor', changeStream);
    console.log('✅ Real-time aggregation pipelines active');
  }

  async processRealTimeUpdate(newReading) {
    // Process individual readings for real-time dashboards
    const deviceId = newReading.device.device_id;
    const sensorType = newReading.device.sensor_type;

    // Update running statistics
    await this.updateRunningStatistics(deviceId, sensorType, newReading);

    // Check for anomalies
    const anomalyCheck = await this.checkForAnomalies(deviceId, newReading);
    if (anomalyCheck.isAnomaly) {
      await this.handleAnomalyAlert(deviceId, anomalyCheck);
    }
  }

  async updateRunningStatistics(deviceId, sensorType, reading) {
    // Update minute-level running statistics for real-time dashboards
    const analyticsCollection = this.collections.get('analytics');
    const currentMinute = new Date();
    currentMinute.setSeconds(0, 0);

    await analyticsCollection.updateOne(
      {
        "aggregation_metadata.device_id": deviceId,
        "aggregation_metadata.sensor_type": sensorType,
        "window_start": currentMinute
      },
      {
        $inc: {
          "metrics.reading_count": 1,
          "metrics.value_sum": reading.measurements.value
        },
        $min: { "metrics.min_value": reading.measurements.value },
        $max: { "metrics.max_value": reading.measurements.value },
        $push: {
          "metrics.recent_values": {
            $each: [reading.measurements.value],
            $slice: -100  // Keep last 100 values for rolling calculations
          }
        },
        $setOnInsert: {
          aggregation_metadata: {
            device_id: deviceId,
            sensor_type: sensorType,
            facility_id: reading.device.facility_id,
            aggregation_type: "real_time_minute"
          },
          window_start: currentMinute,
          created_at: new Date()
        }
      },
      { upsert: true }
    );
  }

  async checkForAnomalies(deviceId, reading) {
    // Simple anomaly detection based on recent history
    const readingsCollection = this.collections.get('readings');
    const lookbackTime = new Date(reading.timestamp.getTime() - (60 * 60 * 1000)); // 1 hour lookback

    const recentStats = await readingsCollection.aggregate([
      {
        $match: {
          "device.device_id": deviceId,
          "device.sensor_type": reading.device.sensor_type,
          "timestamp": { $gte: lookbackTime, $lt: reading.timestamp }
        }
      },
      {
        $group: {
          _id: null,
          avg_value: { $avg: "$measurements.value" },
          stddev_value: { $stdDevPop: "$measurements.value" },
          count: { $sum: 1 }
        }
      }
    ]).toArray();

    if (recentStats.length === 0 || recentStats[0].count < 10) {
      return { isAnomaly: false, reason: 'insufficient_history' };
    }

    const stats = recentStats[0];
    const currentValue = reading.measurements.value;
    const zScore = Math.abs(currentValue - stats.avg_value) / (stats.stddev_value || 1);

    const isAnomaly = zScore > 3;  // 3-sigma threshold

    return {
      isAnomaly,
      zScore,
      currentValue,
      historicalAverage: stats.avg_value,
      historicalStdDev: stats.stddev_value,
      reason: isAnomaly ? 'statistical_outlier' : 'normal_variation'
    };
  }

  async handleAnomalyAlert(deviceId, anomalyDetails) {
    console.log(`🚨 Anomaly detected for device ${deviceId}:`);
    console.log(`  Z-Score: ${anomalyDetails.zScore.toFixed(2)}`);
    console.log(`  Current Value: ${anomalyDetails.currentValue}`);
    console.log(`  Historical Average: ${anomalyDetails.historicalAverage.toFixed(2)}`);

    // Store anomaly record
    await this.db.collection('anomaly_alerts').insertOne({
      device_id: deviceId,
      detection_timestamp: new Date(),
      anomaly_details: anomalyDetails,
      alert_status: 'active',
      severity: anomalyDetails.zScore > 5 ? 'critical' : 'warning'
    });
  }

  // Utility methods
  parseTimeRange(timeRange) {
    const timeMap = {
      '1h': 60 * 60 * 1000,
      '6h': 6 * 60 * 60 * 1000,
      '24h': 24 * 60 * 60 * 1000,
      '7d': 7 * 24 * 60 * 60 * 1000,
      '30d': 30 * 24 * 60 * 60 * 1000
    };
    return timeMap[timeRange] || timeMap['1h'];
  }

  detectSimpleAnomaly(reading) {
    // Placeholder for simple anomaly detection during ingestion
    return false;
  }

  calculateTrendDirection(reading) {
    // Placeholder for trend calculation during ingestion
    return 'stable';
  }

  calculateCompletenessScore(reading) {
    // Calculate data completeness based on expected vs actual fields
    const requiredFields = ['device_id', 'sensor_type', 'value', 'timestamp'];
    const presentFields = requiredFields.filter(field => reading[field] != null);
    return presentFields.length / requiredFields.length;
  }

  async generatePerformanceReport() {
    console.log('Generating time-series performance report...');

    const collections = ['sensor_readings', 'device_heartbeat', 'analytics_aggregations'];
    const report = {
      generated_at: new Date(),
      collections: {}
    };

    for (const collectionName of collections) {
      try {
        const stats = await this.db.runCommand({ collStats: collectionName });
        report.collections[collectionName] = {
          documentCount: stats.count,
          storageSize: stats.storageSize,
          avgObjSize: stats.avgObjSize,
          totalIndexSize: stats.totalIndexSize,
          compressionRatio: stats.storageSize > 0 ? (stats.size / stats.storageSize).toFixed(2) : 0
        };
      } catch (error) {
        report.collections[collectionName] = { error: error.message };
      }
    }

    return report;
  }

  async shutdown() {
    console.log('Shutting down time-series manager...');

    // Close change streams
    for (const [name, stream] of this.realtimeStreams) {
      await stream.close();
      console.log(`✅ Closed stream: ${name}`);
    }

    await this.client.close();
    console.log('Time-series manager shutdown completed');
  }
}

// Export the time-series manager
module.exports = { MongoTimeSeriesManager };

// Benefits of MongoDB Time-Series Collections:
// - Automatic storage optimization and compression for temporal data patterns
// - Native support for time-based bucketing and aggregations without manual partitioning
// - Intelligent indexing strategies optimized for time-series query patterns
// - Built-in data lifecycle management with TTL (time-to-live) capabilities
// - High-performance ingestion with optimized write operations for time-series workloads
// - Advanced analytical capabilities with window functions and statistical aggregations
// - Real-time change streams for immediate processing of incoming sensor data
// - Flexible schema evolution without complex migration strategies
// - Integrated anomaly detection and alerting capabilities
// - SQL-compatible analytical operations through QueryLeaf integration

Understanding MongoDB Time-Series Architecture

Advanced Analytics Patterns for IoT Data

MongoDB Time-Series Collections enable sophisticated analytical patterns for IoT applications:

// Advanced IoT analytics patterns with MongoDB Time-Series Collections
class IoTAnalyticsProcessor {
  constructor(db) {
    this.db = db;
    this.analyticsCache = new Map();
    this.alertThresholds = new Map();
  }

  async implementAdvancedAnalytics() {
    console.log('Implementing advanced IoT analytics patterns...');

    // Pattern 1: Hierarchical time-series aggregations
    await this.createHierarchicalAggregations();

    // Pattern 2: Cross-device correlation analysis
    await this.implementCrossDeviceAnalysis();

    // Pattern 3: Predictive maintenance analytics
    await this.setupPredictiveAnalytics();

    // Pattern 4: Real-time dashboard feeds
    await this.createRealTimeDashboards();

    console.log('Advanced analytics patterns implemented');
  }

  async createHierarchicalAggregations() {
    console.log('Creating hierarchical time-series aggregations...');

    const readingsCollection = this.db.collection('sensor_readings');

    // Multi-level time aggregation pipeline
    const hierarchicalPipeline = [
      {
        $match: {
          "timestamp": {
            $gte: new Date(Date.now() - 24 * 60 * 60 * 1000) // Last 24 hours
          },
          "measurements.quality_score": { $gte: 70 }
        }
      },

      // Create multiple time bucket levels
      {
        $facet: {
          // Minute-level aggregations
          minutely: [
            {
              $group: {
                _id: {
                  facility: "$device.facility_id",
                  building: "$device.building_id",
                  sensor_type: "$device.sensor_type",
                  minute: {
                    $dateTrunc: { date: "$timestamp", unit: "minute" }
                  }
                },
                avg_value: { $avg: "$measurements.value" },
                min_value: { $min: "$measurements.value" },
                max_value: { $max: "$measurements.value" },
                reading_count: { $sum: 1 },
                quality_avg: { $avg: "$measurements.quality_score" }
              }
            }
          ],

          // Hourly aggregations
          hourly: [
            {
              $group: {
                _id: {
                  facility: "$device.facility_id",
                  sensor_type: "$device.sensor_type",
                  hour: {
                    $dateTrunc: { date: "$timestamp", unit: "hour" }
                  }
                },
                avg_value: { $avg: "$measurements.value" },
                min_value: { $min: "$measurements.value" },
                max_value: { $max: "$measurements.value" },
                reading_count: { $sum: 1 },
                device_count: { $addToSet: "$device.device_id" },
                building_coverage: { $addToSet: "$device.building_id" }
              }
            },
            {
              $addFields: {
                device_count: { $size: "$device_count" },
                building_count: { $size: "$building_coverage" }
              }
            }
          ],

          // Daily aggregations  
          daily: [
            {
              $group: {
                _id: {
                  facility: "$device.facility_id",
                  sensor_type: "$device.sensor_type",
                  day: {
                    $dateTrunc: { date: "$timestamp", unit: "day" }
                  }
                },
                avg_value: { $avg: "$measurements.value" },
                min_value: { $min: "$measurements.value" },
                max_value: { $max: "$measurements.value" },
                reading_count: { $sum: 1 },
                unique_devices: { $addToSet: "$device.device_id" },
                data_coverage_hours: {
                  $addToSet: {
                    $dateTrunc: { date: "$timestamp", unit: "hour" }
                  }
                }
              }
            },
            {
              $addFields: {
                device_count: { $size: "$unique_devices" },
                coverage_hours: { $size: "$data_coverage_hours" },
                coverage_percentage: {
                  $multiply: [
                    { $divide: [{ $size: "$data_coverage_hours" }, 24] },
                    100
                  ]
                }
              }
            }
          ]
        }
      }
    ];

    const hierarchicalResults = await readingsCollection.aggregate(hierarchicalPipeline, {
      allowDiskUse: true
    }).toArray();

    // Store aggregated results
    const analyticsCollection = this.db.collection('analytics_aggregations');

    for (const levelName of ['minutely', 'hourly', 'daily']) {
      const levelData = hierarchicalResults[0][levelName];

      if (levelData && levelData.length > 0) {
        const documents = levelData.map(agg => ({
          window_start: agg._id[levelName === 'minutely' ? 'minute' : levelName === 'hourly' ? 'hour' : 'day'],
          aggregation_metadata: {
            aggregation_level: levelName,
            facility_id: agg._id.facility,
            sensor_type: agg._id.sensor_type,
            building_id: agg._id.building,
            generated_at: new Date()
          },
          metrics: {
            avg_value: agg.avg_value,
            min_value: agg.min_value,
            max_value: agg.max_value,
            reading_count: agg.reading_count,
            device_count: agg.device_count,
            coverage_percentage: agg.coverage_percentage,
            quality_average: agg.quality_avg
          }
        }));

        await analyticsCollection.insertMany(documents, { ordered: false });
      }
    }

    console.log('✅ Hierarchical aggregations completed');
  }

  async implementCrossDeviceAnalysis() {
    console.log('Implementing cross-device correlation analysis...');

    const readingsCollection = this.db.collection('sensor_readings');

    // Cross-device correlation pipeline
    const correlationPipeline = [
      {
        $match: {
          "timestamp": {
            $gte: new Date(Date.now() - 6 * 60 * 60 * 1000) // Last 6 hours
          },
          "device.facility_id": { $exists: true }
        }
      },

      // Group by facility and time windows
      {
        $group: {
          _id: {
            facility: "$device.facility_id",
            time_window: {
              $dateTrunc: { 
                date: "$timestamp", 
                unit: "minute",
                binSize: 15  // 15-minute windows
              }
            }
          },

          // Collect readings by sensor type
          temperature_readings: {
            $push: {
              $cond: [
                { $eq: ["$device.sensor_type", "temperature"] },
                "$measurements.value",
                "$$REMOVE"
              ]
            }
          },
          humidity_readings: {
            $push: {
              $cond: [
                { $eq: ["$device.sensor_type", "humidity"] },
                "$measurements.value", 
                "$$REMOVE"
              ]
            }
          },
          co2_readings: {
            $push: {
              $cond: [
                { $eq: ["$device.sensor_type", "co2"] },
                "$measurements.value",
                "$$REMOVE"
              ]
            }
          },
          air_quality_readings: {
            $push: {
              $cond: [
                { $eq: ["$device.sensor_type", "air_quality"] },
                "$measurements.value",
                "$$REMOVE"
              ]
            }
          },

          total_readings: { $sum: 1 },
          unique_devices: { $addToSet: "$device.device_id" }
        }
      },

      // Calculate correlations and insights
      {
        $addFields: {
          // Calculate averages for each sensor type
          avg_temperature: { $avg: "$temperature_readings" },
          avg_humidity: { $avg: "$humidity_readings" },
          avg_co2: { $avg: "$co2_readings" },
          avg_air_quality: { $avg: "$air_quality_readings" },

          device_count: { $size: "$unique_devices" },

          // Data completeness by sensor type
          temperature_coverage: { $size: "$temperature_readings" },
          humidity_coverage: { $size: "$humidity_readings" },
          co2_coverage: { $size: "$co2_readings" },
          air_quality_coverage: { $size: "$air_quality_readings" }
        }
      },

      // Add correlation analysis
      {
        $addFields: {
          // Environmental comfort index calculation
          comfort_index: {
            $switch: {
              branches: [
                {
                  case: {
                    $and: [
                      { $gte: ["$avg_temperature", 20] },
                      { $lte: ["$avg_temperature", 24] },
                      { $gte: ["$avg_humidity", 30] },
                      { $lte: ["$avg_humidity", 60] }
                    ]
                  },
                  then: "optimal"
                },
                {
                  case: {
                    $and: [
                      { $gte: ["$avg_temperature", 18] },
                      { $lte: ["$avg_temperature", 26] },
                      { $gte: ["$avg_humidity", 25] },
                      { $lte: ["$avg_humidity", 70] }
                    ]
                  },
                  then: "good"
                }
              ],
              default: "suboptimal"
            }
          },

          // Air quality assessment
          air_quality_status: {
            $switch: {
              branches: [
                { case: { $lte: ["$avg_co2", 1000] }, then: "excellent" },
                { case: { $lte: ["$avg_co2", 1500] }, then: "good" },
                { case: { $lte: ["$avg_co2", 2000] }, then: "moderate" }
              ],
              default: "poor"
            }
          },

          // Data quality assessment
          data_quality_score: {
            $divide: [
              {
                $add: [
                  "$temperature_coverage",
                  "$humidity_coverage", 
                  "$co2_coverage",
                  "$air_quality_coverage"
                ]
              },
              { $multiply: ["$device_count", 4] }  // Assuming 4 sensor types per device
            ]
          }
        }
      },

      // Filter for meaningful results
      {
        $match: {
          "device_count": { $gte: 2 },  // At least 2 devices
          "total_readings": { $gte: 10 } // At least 10 readings
        }
      },

      // Sort by time window
      {
        $sort: { "_id.time_window": 1 }
      }
    ];

    const correlationResults = await readingsCollection.aggregate(correlationPipeline, {
      allowDiskUse: true
    }).toArray();

    // Store correlation analysis results
    if (correlationResults.length > 0) {
      const correlationDocs = correlationResults.map(result => ({
        window_start: result._id.time_window,
        aggregation_metadata: {
          aggregation_type: "cross_device_correlation",
          facility_id: result._id.facility,
          analysis_timestamp: new Date()
        },
        environmental_metrics: {
          avg_temperature: result.avg_temperature,
          avg_humidity: result.avg_humidity,
          avg_co2: result.avg_co2,
          avg_air_quality: result.avg_air_quality
        },
        assessments: {
          comfort_index: result.comfort_index,
          air_quality_status: result.air_quality_status,
          data_quality_score: result.data_quality_score
        },
        coverage_stats: {
          device_count: result.device_count,
          total_readings: result.total_readings,
          sensor_coverage: {
            temperature: result.temperature_coverage,
            humidity: result.humidity_coverage,
            co2: result.co2_coverage,
            air_quality: result.air_quality_coverage
          }
        }
      }));

      await this.db.collection('analytics_aggregations').insertMany(correlationDocs, {
        ordered: false
      });
    }

    console.log(`✅ Cross-device correlation analysis completed: ${correlationResults.length} facility-time windows analyzed`);
  }

  async setupPredictiveAnalytics() {
    console.log('Setting up predictive maintenance analytics...');

    const readingsCollection = this.db.collection('sensor_readings');

    // Predictive analytics pipeline for device health
    const predictivePipeline = [
      {
        $match: {
          "timestamp": {
            $gte: new Date(Date.now() - 7 * 24 * 60 * 60 * 1000) // Last 7 days
          }
        }
      },

      // Group by device and calculate health indicators
      {
        $group: {
          _id: {
            device_id: "$device.device_id",
            sensor_type: "$device.sensor_type"
          },

          // Time-series health metrics
          reading_timestamps: { $push: "$timestamp" },
          quality_scores: { $push: "$measurements.quality_score" },
          values: { $push: "$measurements.value" },

          // System health indicators  
          battery_levels: { $push: "$system.battery_level" },
          signal_strengths: { $push: "$system.signal_strength" },
          transmission_latencies: { $push: "$system.transmission_latency_ms" },

          // Basic statistics
          total_readings: { $sum: 1 },
          avg_value: { $avg: "$measurements.value" },
          avg_quality: { $avg: "$measurements.quality_score" },

          // Device metadata
          device_info: { $first: "$device" },
          latest_timestamp: { $max: "$timestamp" },
          earliest_timestamp: { $min: "$timestamp" }
        }
      },

      // Calculate predictive health indicators
      {
        $addFields: {
          // Expected readings calculation
          time_span_hours: {
            $divide: [
              { $subtract: ["$latest_timestamp", "$earliest_timestamp"] },
              3600000  // Convert to hours
            ]
          },

          expected_readings: {
            $divide: [
              { $multiply: ["$time_span_hours", 3600] },  // Total seconds
              { $ifNull: ["$device_info.sampling_interval", 300] }  // Default 5 min interval
            ]
          }
        }
      },

      {
        $addFields: {
          // Data availability percentage
          data_availability: {
            $multiply: [
              { $divide: ["$total_readings", "$expected_readings"] },
              100
            ]
          },

          // Quality trend analysis
          recent_quality: {
            $avg: {
              $slice: ["$quality_scores", -20]  // Last 20 readings
            }
          },

          historical_quality: {
            $avg: {
              $slice: ["$quality_scores", 0, 20]  // First 20 readings  
            }
          },

          // Battery health trend
          current_battery: {
            $avg: {
              $slice: ["$battery_levels", -10]  // Last 10 readings
            }
          },

          initial_battery: {
            $avg: {
              $slice: ["$battery_levels", 0, 10]  // First 10 readings
            }
          },

          // Signal quality trend
          avg_signal_strength: { $avg: "$signal_strengths" },
          avg_latency: { $avg: "$transmission_latencies" }
        }
      },

      // Calculate health scores and predictions
      {
        $addFields: {
          // Overall device health score (0-100)
          health_score: {
            $min: [
              100,
              {
                $multiply: [
                  {
                    $add: [
                      // Data availability component (40%)
                      { $multiply: [{ $min: ["$data_availability", 100] }, 0.4] },

                      // Quality component (30%)
                      { $multiply: ["$avg_quality", 0.3] },

                      // Battery component (20%)
                      { 
                        $multiply: [
                          { $ifNull: ["$current_battery", 100] },
                          0.2
                        ]
                      },

                      // Signal component (10%)
                      {
                        $multiply: [
                          {
                            $cond: {
                              if: { $gte: ["$avg_signal_strength", -70] },
                              then: 100,
                              else: {
                                $max: [0, { $add: [100, { $multiply: ["$avg_signal_strength", 1.5] }] }]
                              }
                            }
                          },
                          0.1
                        ]
                      }
                    ]
                  }
                ]
              }
            ]
          },

          // Maintenance predictions
          quality_trend: {
            $cond: {
              if: { $gt: ["$recent_quality", "$historical_quality"] },
              then: "improving",
              else: {
                $cond: {
                  if: { $lt: ["$recent_quality", { $multiply: ["$historical_quality", 0.9] }] },
                  then: "degrading",
                  else: "stable"
                }
              }
            }
          },

          battery_trend: {
            $cond: {
              if: { $and: ["$current_battery", "$initial_battery"] },
              then: {
                $cond: {
                  if: { $lt: ["$current_battery", { $multiply: ["$initial_battery", 0.8] }] },
                  then: "declining",
                  else: "stable"
                }
              },
              else: "unknown"
            }
          },

          // Estimated days until maintenance needed
          maintenance_urgency: {
            $switch: {
              branches: [
                {
                  case: { $lt: ["$health_score", 60] },
                  then: "immediate"
                },
                {
                  case: { $lt: ["$health_score", 75] },
                  then: "within_week"
                },
                {
                  case: { $lt: ["$health_score", 85] },
                  then: "within_month"
                }
              ],
              default: "routine"
            }
          }
        }
      },

      // Filter devices that need attention
      {
        $match: {
          $or: [
            { "health_score": { $lt: 90 } },
            { "quality_trend": "degrading" },
            { "battery_trend": "declining" },
            { "data_availability": { $lt: 90 } }
          ]
        }
      },

      // Sort by health score (worst first)
      {
        $sort: { "health_score": 1 }
      }
    ];

    const predictiveResults = await readingsCollection.aggregate(predictivePipeline, {
      allowDiskUse: true
    }).toArray();

    // Store predictive analytics results
    if (predictiveResults.length > 0) {
      const maintenanceDocs = predictiveResults.map(result => ({
        window_start: new Date(),
        aggregation_metadata: {
          aggregation_type: "predictive_maintenance",
          device_id: result._id.device_id,
          sensor_type: result._id.sensor_type,
          analysis_timestamp: new Date()
        },
        health_assessment: {
          overall_health_score: Math.round(result.health_score * 100) / 100,
          data_availability: Math.round(result.data_availability * 100) / 100,
          quality_trend: result.quality_trend,
          battery_trend: result.battery_trend,
          maintenance_urgency: result.maintenance_urgency
        },
        metrics: {
          total_readings: result.total_readings,
          avg_quality: Math.round(result.avg_quality * 100) / 100,
          avg_signal_strength: result.avg_signal_strength,
          avg_latency: result.avg_latency,
          current_battery_level: result.current_battery
        },
        recommendations: this.generateMaintenanceRecommendations(result)
      }));

      await this.db.collection('maintenance_predictions').insertMany(maintenanceDocs, {
        ordered: false
      });
    }

    console.log(`✅ Predictive analytics completed: ${predictiveResults.length} devices analyzed`);
    return predictiveResults;
  }

  generateMaintenanceRecommendations(deviceAnalysis) {
    const recommendations = [];

    if (deviceAnalysis.health_score < 60) {
      recommendations.push('Immediate inspection required - device health critical');
    }

    if (deviceAnalysis.data_availability < 80) {
      recommendations.push('Check connectivity and power supply');
    }

    if (deviceAnalysis.quality_trend === 'degrading') {
      recommendations.push('Sensor calibration may be needed');
    }

    if (deviceAnalysis.battery_trend === 'declining') {
      recommendations.push('Schedule battery replacement');
    }

    if (deviceAnalysis.avg_signal_strength < -80) {
      recommendations.push('Improve network coverage or relocate device');
    }

    return recommendations.length > 0 ? recommendations : ['Continue routine monitoring'];
  }
}

// Export the analytics processor
module.exports = { IoTAnalyticsProcessor };

SQL-Style Time-Series Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB Time-Series Collections operations:

-- QueryLeaf time-series operations with SQL-familiar syntax

-- Create time-series collection with SQL DDL syntax
CREATE TABLE sensor_readings (
  timestamp TIMESTAMPTZ NOT NULL,
  device_id VARCHAR(50) NOT NULL,
  sensor_type VARCHAR(50) NOT NULL,
  value NUMERIC(10,4) NOT NULL,
  quality_score INTEGER DEFAULT 100,

  -- Device metadata (metaField in MongoDB)
  facility_id VARCHAR(50),
  building_id VARCHAR(50), 
  location VARCHAR(200),

  -- Environmental context
  ambient_temperature NUMERIC(5,2),
  humidity NUMERIC(5,2),
  atmospheric_pressure NUMERIC(7,2)
) WITH (
  collection_type = 'timeseries',
  time_field = 'timestamp',
  meta_field = 'device_metadata',
  granularity = 'minutes',
  expire_after_seconds = 63072000  -- 2 years
);

-- Time-series data ingestion with SQL INSERT
INSERT INTO sensor_readings (
  timestamp, device_id, sensor_type, value, quality_score,
  facility_id, building_id, location,
  ambient_temperature, humidity, atmospheric_pressure
) VALUES 
  ('2025-11-15 10:00:00'::TIMESTAMPTZ, 'TEMP-001', 'temperature', 22.5, 98, 'FAC-A', 'BLDG-1', 'Conference Room A', 22.5, 45.2, 1013.25),
  ('2025-11-15 10:00:00'::TIMESTAMPTZ, 'HUM-001', 'humidity', 45.2, 95, 'FAC-A', 'BLDG-1', 'Conference Room A', 22.5, 45.2, 1013.25),
  ('2025-11-15 10:00:00'::TIMESTAMPTZ, 'CO2-001', 'co2', 850, 92, 'FAC-A', 'BLDG-1', 'Conference Room A', 22.5, 45.2, 1013.25);

-- Time-series analytical queries with window functions
WITH hourly_sensor_analytics AS (
  SELECT 
    device_id,
    sensor_type,
    facility_id,
    DATE_TRUNC('hour', timestamp) as hour_bucket,

    -- Statistical aggregations
    COUNT(*) as reading_count,
    AVG(value) as avg_value,
    MIN(value) as min_value,  
    MAX(value) as max_value,
    STDDEV(value) as stddev_value,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value) as median_value,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY value) as p95_value,

    -- Quality metrics
    AVG(quality_score) as avg_quality,
    COUNT(*) FILTER (WHERE quality_score < 80) as poor_quality_count,

    -- Environmental correlations
    AVG(ambient_temperature) as avg_ambient_temp,
    AVG(humidity) as avg_humidity,
    CORR(value, ambient_temperature) as temp_correlation,
    CORR(value, humidity) as humidity_correlation,

    -- Data completeness assessment
    COUNT(*) * 100.0 / 60 as data_completeness_percent  -- Expected: 60 readings per hour

  FROM sensor_readings
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    AND timestamp < CURRENT_TIMESTAMP
    AND quality_score >= 50  -- Filter poor quality data
  GROUP BY device_id, sensor_type, facility_id, DATE_TRUNC('hour', timestamp)
),

time_series_insights AS (
  SELECT 
    hsa.*,

    -- Time-based analytical functions
    LAG(avg_value) OVER (
      PARTITION BY device_id, sensor_type 
      ORDER BY hour_bucket
    ) as previous_hour_avg,

    -- Moving averages for trend analysis
    AVG(avg_value) OVER (
      PARTITION BY device_id, sensor_type
      ORDER BY hour_bucket
      ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
    ) as moving_avg_6h,

    AVG(avg_value) OVER (
      PARTITION BY device_id, sensor_type
      ORDER BY hour_bucket  
      ROWS BETWEEN 23 PRECEDING AND CURRENT ROW
    ) as moving_avg_24h,

    -- Anomaly detection using z-score
    (avg_value - AVG(avg_value) OVER (
      PARTITION BY device_id, sensor_type, EXTRACT(hour FROM hour_bucket)
    )) / NULLIF(STDDEV(avg_value) OVER (
      PARTITION BY device_id, sensor_type, EXTRACT(hour FROM hour_bucket)
    ), 0) as hourly_zscore,

    -- Rate of change calculations
    CASE 
      WHEN LAG(avg_value) OVER (PARTITION BY device_id, sensor_type ORDER BY hour_bucket) IS NOT NULL
      THEN (avg_value - LAG(avg_value) OVER (PARTITION BY device_id, sensor_type ORDER BY hour_bucket)) 
           / NULLIF(LAG(avg_value) OVER (PARTITION BY device_id, sensor_type ORDER BY hour_bucket), 0) * 100
      ELSE 0
    END as hourly_change_percent

  FROM hourly_sensor_analytics hsa
),

anomaly_detection AS (
  SELECT 
    tsi.*,

    -- Anomaly classification
    CASE 
      WHEN ABS(hourly_zscore) > 3 THEN 'statistical_anomaly'
      WHEN ABS(hourly_change_percent) > 50 AND moving_avg_6h IS NOT NULL THEN 'rapid_change'
      WHEN data_completeness_percent < 70 THEN 'data_availability_issue'
      WHEN avg_quality < 70 THEN 'data_quality_issue'
      ELSE 'normal'
    END as anomaly_type,

    -- Alert priority
    CASE 
      WHEN ABS(hourly_zscore) > 4 OR ABS(hourly_change_percent) > 75 THEN 'critical'
      WHEN ABS(hourly_zscore) > 3 OR ABS(hourly_change_percent) > 50 THEN 'high'
      WHEN data_completeness_percent < 70 OR avg_quality < 70 THEN 'medium'
      ELSE 'low'
    END as alert_priority,

    -- Performance classification  
    CASE 
      WHEN data_completeness_percent >= 95 AND avg_quality >= 90 THEN 'excellent'
      WHEN data_completeness_percent >= 85 AND avg_quality >= 80 THEN 'good'
      WHEN data_completeness_percent >= 70 AND avg_quality >= 70 THEN 'fair'
      ELSE 'poor'
    END as performance_rating

  FROM time_series_insights tsi
)

SELECT 
  device_id,
  sensor_type,
  facility_id,
  TO_CHAR(hour_bucket, 'YYYY-MM-DD HH24:00') as analysis_hour,

  -- Core time-series metrics
  reading_count,
  ROUND(avg_value::NUMERIC, 3) as average_value,
  ROUND(min_value::NUMERIC, 3) as minimum_value,
  ROUND(max_value::NUMERIC, 3) as maximum_value,
  ROUND(stddev_value::NUMERIC, 3) as std_deviation,
  ROUND(median_value::NUMERIC, 3) as median_value,
  ROUND(p95_value::NUMERIC, 3) as p95_value,

  -- Trend analysis
  ROUND(hourly_change_percent::NUMERIC, 2) as hourly_change_pct,
  ROUND(moving_avg_6h::NUMERIC, 3) as six_hour_moving_avg,
  ROUND(moving_avg_24h::NUMERIC, 3) as daily_moving_avg,

  -- Anomaly detection
  ROUND(hourly_zscore::NUMERIC, 3) as anomaly_zscore,
  anomaly_type,
  alert_priority,

  -- Quality and performance
  ROUND(data_completeness_percent::NUMERIC, 1) as data_completeness_pct,
  ROUND(avg_quality::NUMERIC, 1) as average_quality_score,
  poor_quality_count,
  performance_rating,

  -- Environmental correlations
  ROUND(temp_correlation::NUMERIC, 3) as temperature_correlation,
  ROUND(humidity_correlation::NUMERIC, 3) as humidity_correlation,
  ROUND(avg_ambient_temp::NUMERIC, 1) as avg_ambient_temperature,
  ROUND(avg_humidity::NUMERIC, 1) as avg_humidity_percent,

  -- Alert conditions
  CASE 
    WHEN anomaly_type != 'normal' THEN 
      CONCAT('Alert: ', anomaly_type, ' detected with ', alert_priority, ' priority')
    WHEN performance_rating IN ('poor', 'fair') THEN
      CONCAT('Performance issue: ', performance_rating, ' quality detected')
    ELSE 'Normal operation'
  END as status_message

FROM anomaly_detection
WHERE hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
ORDER BY 
  facility_id,
  device_id,
  sensor_type,
  hour_bucket DESC;

-- Cross-device environmental correlation analysis
WITH facility_environmental_data AS (
  SELECT 
    facility_id,
    building_id,
    DATE_TRUNC('minute', timestamp, 15) as time_window,  -- 15-minute buckets

    -- Aggregate by sensor type
    AVG(CASE WHEN sensor_type = 'temperature' THEN value END) as avg_temperature,
    AVG(CASE WHEN sensor_type = 'humidity' THEN value END) as avg_humidity,
    AVG(CASE WHEN sensor_type = 'co2' THEN value END) as avg_co2,
    AVG(CASE WHEN sensor_type = 'air_quality' THEN value END) as avg_air_quality,

    -- Count devices by type
    COUNT(DISTINCT CASE WHEN sensor_type = 'temperature' THEN device_id END) as temp_devices,
    COUNT(DISTINCT CASE WHEN sensor_type = 'humidity' THEN device_id END) as humidity_devices,
    COUNT(DISTINCT CASE WHEN sensor_type = 'co2' THEN device_id END) as co2_devices,

    -- Overall data quality
    AVG(quality_score) as avg_quality,
    COUNT(*) as total_readings

  FROM sensor_readings
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '6 hours'
    AND facility_id IS NOT NULL
    AND quality_score >= 70
  GROUP BY facility_id, building_id, DATE_TRUNC('minute', timestamp, 15)
  HAVING COUNT(*) >= 5  -- Minimum readings threshold
),

environmental_assessment AS (
  SELECT 
    fed.*,

    -- Environmental comfort calculations
    CASE 
      WHEN avg_temperature BETWEEN 20 AND 24 AND avg_humidity BETWEEN 30 AND 60 THEN 'optimal'
      WHEN avg_temperature BETWEEN 18 AND 26 AND avg_humidity BETWEEN 25 AND 70 THEN 'comfortable'
      WHEN avg_temperature BETWEEN 16 AND 28 AND avg_humidity BETWEEN 20 AND 80 THEN 'acceptable'
      ELSE 'uncomfortable'
    END as comfort_level,

    -- Air quality assessment
    CASE 
      WHEN avg_co2 <= 1000 THEN 'excellent'
      WHEN avg_co2 <= 1500 THEN 'good'  
      WHEN avg_co2 <= 2000 THEN 'moderate'
      WHEN avg_co2 <= 5000 THEN 'poor'
      ELSE 'hazardous'
    END as air_quality_level,

    -- Data coverage assessment
    CASE 
      WHEN temp_devices >= 2 AND humidity_devices >= 2 AND co2_devices >= 1 THEN 'comprehensive'
      WHEN temp_devices >= 1 AND humidity_devices >= 1 THEN 'basic'
      ELSE 'limited'
    END as sensor_coverage,

    -- Environmental health score (0-100)
    (
      CASE 
        WHEN avg_temperature BETWEEN 20 AND 24 THEN 25
        WHEN avg_temperature BETWEEN 18 AND 26 THEN 20
        WHEN avg_temperature BETWEEN 16 AND 28 THEN 15
        ELSE 5
      END +
      CASE 
        WHEN avg_humidity BETWEEN 40 AND 50 THEN 25
        WHEN avg_humidity BETWEEN 30 AND 60 THEN 20
        WHEN avg_humidity BETWEEN 25 AND 70 THEN 15
        ELSE 5
      END +
      CASE 
        WHEN avg_co2 <= 800 THEN 25
        WHEN avg_co2 <= 1000 THEN 20
        WHEN avg_co2 <= 1500 THEN 15
        WHEN avg_co2 <= 2000 THEN 10
        ELSE 0
      END +
      CASE 
        WHEN avg_air_quality >= 80 THEN 25
        WHEN avg_air_quality >= 60 THEN 20
        WHEN avg_air_quality >= 40 THEN 15
        ELSE 5
      END
    ) as environmental_health_score

  FROM facility_environmental_data fed
)

SELECT 
  facility_id,
  building_id,
  TO_CHAR(time_window, 'YYYY-MM-DD HH24:MI') as measurement_time,

  -- Environmental measurements
  ROUND(avg_temperature::NUMERIC, 1) as temperature_c,
  ROUND(avg_humidity::NUMERIC, 1) as humidity_percent,
  ROUND(avg_co2::NUMERIC, 0) as co2_ppm,
  ROUND(avg_air_quality::NUMERIC, 1) as air_quality_index,

  -- Assessment results
  comfort_level,
  air_quality_level,
  sensor_coverage,
  environmental_health_score,

  -- Device coverage
  temp_devices,
  humidity_devices,  
  co2_devices,
  total_readings,

  -- Data quality
  ROUND(avg_quality::NUMERIC, 1) as average_data_quality,

  -- Recommendations
  CASE 
    WHEN environmental_health_score >= 90 THEN 'Optimal environmental conditions'
    WHEN environmental_health_score >= 75 THEN 'Good environmental conditions'
    WHEN comfort_level = 'uncomfortable' THEN 'Adjust HVAC settings for comfort'
    WHEN air_quality_level IN ('poor', 'hazardous') THEN 'Improve ventilation immediately'
    WHEN sensor_coverage = 'limited' THEN 'Add more environmental sensors'
    ELSE 'Monitor conditions closely'
  END as recommendation,

  -- Alert conditions
  CASE 
    WHEN avg_co2 > 2000 THEN 'HIGH CO2 ALERT'
    WHEN avg_temperature > 28 OR avg_temperature < 16 THEN 'TEMPERATURE ALERT'
    WHEN avg_humidity > 80 OR avg_humidity < 20 THEN 'HUMIDITY ALERT'
    WHEN environmental_health_score < 50 THEN 'ENVIRONMENTAL QUALITY ALERT'
    ELSE NULL
  END as alert_status

FROM environmental_assessment
WHERE time_window >= CURRENT_TIMESTAMP - INTERVAL '6 hours'
ORDER BY 
  facility_id,
  building_id,
  time_window DESC;

-- Predictive maintenance analytics with time-series data
CREATE VIEW device_health_predictions AS
WITH device_performance_history AS (
  SELECT 
    device_id,
    sensor_type,
    facility_id,

    -- Performance metrics over time
    COUNT(*) as total_readings_7d,
    AVG(quality_score) as avg_quality_7d,
    STDDEV(quality_score) as quality_stability,

    -- Expected vs actual readings
    COUNT(*) * 100.0 / (7 * 24 * 12) as data_availability_percent,  -- Expected: 5min intervals

    -- Value stability analysis
    STDDEV(value) as value_volatility,
    AVG(value) as avg_value_7d,

    -- Trend analysis using linear regression
    REGR_SLOPE(quality_score, EXTRACT(EPOCH FROM timestamp)) as quality_trend_slope,
    REGR_SLOPE(value, EXTRACT(EPOCH FROM timestamp)) as value_trend_slope,

    -- Time coverage
    MAX(timestamp) as last_reading_time,
    MIN(timestamp) as first_reading_time,

    -- Recent performance (last 24h vs historical)
    AVG(CASE WHEN timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours' 
         THEN quality_score END) as recent_quality_24h,
    AVG(CASE WHEN timestamp < CURRENT_TIMESTAMP - INTERVAL '24 hours' 
         THEN quality_score END) as historical_quality

  FROM sensor_readings
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days'
  GROUP BY device_id, sensor_type, facility_id
  HAVING COUNT(*) >= 100  -- Minimum data threshold for analysis
),

health_scoring AS (
  SELECT 
    dph.*,

    -- Overall device health score (0-100)
    (
      -- Data availability component (40%)
      (LEAST(data_availability_percent, 100) * 0.4) +

      -- Quality component (30%)  
      (avg_quality_7d * 0.3) +

      -- Stability component (20%)
      (GREATEST(0, 100 - quality_stability) * 0.2) +

      -- Recency component (10%)
      (CASE 
        WHEN last_reading_time >= CURRENT_TIMESTAMP - INTERVAL '1 hour' THEN 10
        WHEN last_reading_time >= CURRENT_TIMESTAMP - INTERVAL '6 hours' THEN 8
        WHEN last_reading_time >= CURRENT_TIMESTAMP - INTERVAL '24 hours' THEN 5
        ELSE 0
      END)
    ) as device_health_score,

    -- Maintenance predictions
    CASE 
      WHEN data_availability_percent < 70 THEN 'connectivity_issue'
      WHEN avg_quality_7d < 70 THEN 'sensor_degradation'
      WHEN quality_trend_slope < -0.1 THEN 'declining_quality'
      WHEN quality_stability > 15 THEN 'unstable_readings'
      WHEN last_reading_time < CURRENT_TIMESTAMP - INTERVAL '6 hours' THEN 'communication_failure'
      ELSE 'normal_operation'
    END as maintenance_issue,

    -- Urgency assessment
    CASE 
      WHEN data_availability_percent < 50 OR avg_quality_7d < 50 THEN 'immediate'
      WHEN data_availability_percent < 80 OR avg_quality_7d < 75 THEN 'within_week'
      WHEN quality_trend_slope < -0.05 OR quality_stability > 10 THEN 'within_month'
      ELSE 'routine'
    END as maintenance_urgency,

    -- Performance trend
    CASE 
      WHEN recent_quality_24h > historical_quality * 1.1 THEN 'improving'
      WHEN recent_quality_24h < historical_quality * 0.9 THEN 'degrading'
      ELSE 'stable'
    END as performance_trend

  FROM device_performance_history dph
)

SELECT 
  device_id,
  sensor_type,
  facility_id,

  -- Health metrics
  ROUND(device_health_score::NUMERIC, 1) as health_score,
  ROUND(data_availability_percent::NUMERIC, 1) as data_availability_pct,
  ROUND(avg_quality_7d::NUMERIC, 1) as avg_quality_score,
  ROUND(quality_stability::NUMERIC, 2) as quality_std_dev,

  -- Performance indicators
  performance_trend,
  maintenance_issue,
  maintenance_urgency,

  -- Trend analysis
  CASE 
    WHEN quality_trend_slope > 0.1 THEN 'Quality Improving'
    WHEN quality_trend_slope < -0.1 THEN 'Quality Declining'
    ELSE 'Quality Stable'
  END as quality_trend,

  -- Data freshness
  ROUND(EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - last_reading_time)) / 3600::NUMERIC, 1) as hours_since_last_reading,

  -- Maintenance recommendations
  CASE maintenance_issue
    WHEN 'connectivity_issue' THEN 'Check network connectivity and power supply'
    WHEN 'sensor_degradation' THEN 'Schedule sensor calibration or replacement'
    WHEN 'declining_quality' THEN 'Investigate environmental factors affecting sensor'
    WHEN 'unstable_readings' THEN 'Check sensor mounting and interference sources'
    WHEN 'communication_failure' THEN 'Immediate device inspection required'
    ELSE 'Continue routine monitoring'
  END as maintenance_recommendation,

  -- Priority ranking
  CASE maintenance_urgency
    WHEN 'immediate' THEN 1
    WHEN 'within_week' THEN 2  
    WHEN 'within_month' THEN 3
    ELSE 4
  END as priority_rank

FROM health_scoring
ORDER BY 
  CASE maintenance_urgency
    WHEN 'immediate' THEN 1
    WHEN 'within_week' THEN 2
    WHEN 'within_month' THEN 3
    ELSE 4
  END,
  device_health_score ASC;

-- QueryLeaf provides comprehensive time-series capabilities:
-- 1. SQL-familiar CREATE TABLE syntax for time-series collections
-- 2. Advanced window functions and time-based aggregations
-- 3. Built-in anomaly detection with statistical analysis
-- 4. Cross-device correlation analysis and environmental assessments
-- 5. Predictive maintenance analytics with health scoring
-- 6. Real-time monitoring and alerting with SQL queries
-- 7. Hierarchical time aggregations (minute/hour/day levels)
-- 8. Performance trend analysis and maintenance recommendations
-- 9. Native integration with MongoDB time-series optimizations
-- 10. Familiar SQL patterns for complex IoT analytics requirements

Best Practices for Time-Series Implementation

Collection Design and Optimization

Essential practices for production time-series deployments:

Granularity Selection: Choose appropriate granularity (seconds/minutes/hours) based on data frequency
MetaField Strategy: Design metaField schemas that optimize for common query patterns
TTL Management: Implement time-based data lifecycle policies for storage optimization
Index Planning: Create indexes that align with time-based and metadata query patterns
Compression Benefits: Leverage MongoDB's automatic compression for time-series data
Schema Evolution: Design flexible schemas that accommodate IoT device changes

Performance and Scalability

Optimize time-series collections for high-throughput IoT workloads:

Batch Ingestion: Use bulk operations for high-frequency sensor data ingestion
Write Concern: Balance durability and performance with appropriate write concerns
Read Optimization: Use aggregation pipelines for efficient analytical queries
Real-time Processing: Implement change streams for immediate data processing
Memory Management: Monitor working set size and configure appropriate caching
Sharding Strategy: Plan horizontal scaling for very high-volume deployments

Conclusion

MongoDB Time-Series Collections provide comprehensive IoT data management capabilities that eliminate the complexity and overhead of traditional time-series database approaches. The combination of automatic storage optimization, intelligent indexing, and sophisticated analytical capabilities enables high-performance IoT applications that scale efficiently with growing sensor deployments.

Key Time-Series Collection benefits include:

Automatic Optimization: Native storage compression and intelligent bucketing for temporal data patterns
Simplified Operations: No manual partitioning or complex maintenance procedures required
High-Performance Analytics: Built-in support for statistical aggregations and window functions
Real-time Processing: Change streams enable immediate response to incoming sensor data
Flexible Schema: Easy accommodation of evolving IoT device capabilities and data structures
SQL Compatibility: Familiar query patterns for complex time-series analytical operations

Whether you're building smart building systems, industrial monitoring platforms, environmental sensors networks, or any IoT application requiring temporal data analysis, MongoDB Time-Series Collections with QueryLeaf's SQL-familiar interface provides the foundation for modern IoT analytics that scales efficiently while maintaining familiar database interaction patterns.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB Time-Series Collections while providing SQL-familiar syntax for time-series operations, statistical analysis, and IoT analytics. Advanced time-based aggregations, anomaly detection, and predictive maintenance patterns are seamlessly accessible through familiar SQL constructs, making sophisticated IoT development both powerful and approachable for SQL-oriented teams.

The integration of optimized time-series capabilities with SQL-style operations makes MongoDB an ideal platform for IoT applications that require both high-performance temporal data processing and familiar analytical query patterns, ensuring your IoT solutions remain both effective and maintainable as they scale and evolve.

November 14, 2025
23 min read

MongoDB Change Streams for Event-Driven Microservices: Real-Time Architecture and Reactive Data Processing

Modern distributed applications require real-time responsiveness to data changes, enabling immediate updates across microservices, cache invalidation, data synchronization, and user notification systems. Traditional polling-based approaches create unnecessary load, introduce latency, and fail to scale with growing data volumes and user expectations for instant updates.

MongoDB Change Streams provide native change data capture (CDC) capabilities that enable real-time event-driven architectures without the complexity of external message queues or polling mechanisms. Unlike traditional database triggers that operate at the database level with limited scalability, Change Streams offer application-level event processing with comprehensive filtering, transformation, and distributed processing capabilities.

The Traditional Event Processing Challenge

Building real-time event-driven systems with traditional databases requires complex infrastructure and polling mechanisms:

-- Traditional PostgreSQL event processing - complex and inefficient

-- Event log table for change tracking
CREATE TABLE event_log (
    event_id BIGSERIAL PRIMARY KEY,
    table_name VARCHAR(100) NOT NULL,
    operation_type VARCHAR(10) NOT NULL, -- INSERT, UPDATE, DELETE
    record_id TEXT NOT NULL,
    old_data JSONB,
    new_data JSONB,
    event_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    processed BOOLEAN DEFAULT FALSE,

    -- Event routing information
    event_type VARCHAR(50),
    service_name VARCHAR(50),
    correlation_id UUID,

    -- Processing metadata
    retry_count INTEGER DEFAULT 0,
    last_retry_at TIMESTAMP,
    error_message TEXT,

    -- Partitioning for performance
    created_date DATE GENERATED ALWAYS AS (DATE(event_timestamp)) STORED
);

-- Partition by date for performance
CREATE TABLE event_log_2025_11 PARTITION OF event_log
    FOR VALUES FROM ('2025-11-01') TO ('2025-12-01');

-- Indexes for event processing
CREATE INDEX idx_event_log_unprocessed ON event_log(processed, event_timestamp) 
    WHERE processed = FALSE;
CREATE INDEX idx_event_log_correlation ON event_log(correlation_id);
CREATE INDEX idx_event_log_service ON event_log(service_name, event_timestamp);

-- Product catalog table with change tracking
CREATE TABLE products (
    product_id BIGSERIAL PRIMARY KEY,
    sku VARCHAR(50) UNIQUE NOT NULL,
    name VARCHAR(200) NOT NULL,
    description TEXT,
    price DECIMAL(12,2) NOT NULL,
    category_id BIGINT,
    inventory_count INTEGER DEFAULT 0,
    status VARCHAR(20) DEFAULT 'active',

    -- Metadata
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    version INTEGER DEFAULT 1
);

-- Trigger function for change tracking
CREATE OR REPLACE FUNCTION log_product_changes() 
RETURNS TRIGGER AS $$
DECLARE
    event_data JSONB;
    operation_type TEXT;
BEGIN
    -- Determine operation type
    IF TG_OP = 'DELETE' THEN
        operation_type := 'DELETE';
        event_data := to_jsonb(OLD);
    ELSIF TG_OP = 'UPDATE' THEN
        operation_type := 'UPDATE';
        event_data := jsonb_build_object(
            'old', to_jsonb(OLD),
            'new', to_jsonb(NEW)
        );
    ELSIF TG_OP = 'INSERT' THEN
        operation_type := 'INSERT';
        event_data := to_jsonb(NEW);
    END IF;

    -- Insert event log entry
    INSERT INTO event_log (
        table_name,
        operation_type, 
        record_id,
        old_data,
        new_data,
        event_type,
        correlation_id
    ) VALUES (
        TG_TABLE_NAME,
        operation_type,
        CASE 
            WHEN TG_OP = 'DELETE' THEN OLD.product_id::TEXT
            ELSE NEW.product_id::TEXT
        END,
        CASE WHEN TG_OP IN ('UPDATE', 'DELETE') THEN to_jsonb(OLD) ELSE NULL END,
        CASE WHEN TG_OP IN ('UPDATE', 'INSERT') THEN to_jsonb(NEW) ELSE NULL END,
        'product_change',
        gen_random_uuid()
    );

    RETURN COALESCE(NEW, OLD);
END;
$$ LANGUAGE plpgsql;

-- Create triggers for change tracking
CREATE TRIGGER product_change_trigger
    AFTER INSERT OR UPDATE OR DELETE ON products
    FOR EACH ROW EXECUTE FUNCTION log_product_changes();

-- Complex polling-based event processing
WITH unprocessed_events AS (
    SELECT 
        event_id,
        table_name,
        operation_type,
        record_id,
        old_data,
        new_data,
        event_timestamp,
        event_type,
        correlation_id,

        -- Determine event priority
        CASE 
            WHEN event_type = 'product_change' AND operation_type = 'UPDATE' THEN
                CASE 
                    WHEN (new_data->>'status') != (old_data->>'status') THEN 1 -- Status changes are critical
                    WHEN (new_data->>'price')::NUMERIC != (old_data->>'price')::NUMERIC THEN 2 -- Price changes
                    WHEN (new_data->>'inventory_count')::INTEGER != (old_data->>'inventory_count')::INTEGER THEN 3 -- Inventory
                    ELSE 4 -- Other changes
                END
            WHEN operation_type = 'INSERT' THEN 2
            WHEN operation_type = 'DELETE' THEN 1
            ELSE 5
        END as priority,

        -- Calculate processing delay
        EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - event_timestamp)) as delay_seconds

    FROM event_log
    WHERE processed = FALSE
        AND retry_count < 3 -- Limit retry attempts
        AND (last_retry_at IS NULL OR last_retry_at < CURRENT_TIMESTAMP - INTERVAL '5 minutes')
    ORDER BY priority ASC, event_timestamp ASC
    LIMIT 100 -- Process in batches
),

event_processing_plan AS (
    SELECT 
        ue.*,

        -- Determine target services based on event type
        CASE 
            WHEN event_type = 'product_change' THEN 
                ARRAY['inventory-service', 'catalog-service', 'search-service', 'cache-service']
            ELSE ARRAY['default-service']
        END as target_services,

        -- Generate event payload
        jsonb_build_object(
            'eventId', event_id,
            'eventType', event_type,
            'operationType', operation_type,
            'timestamp', event_timestamp,
            'correlationId', correlation_id,
            'data', 
                CASE 
                    WHEN operation_type = 'UPDATE' THEN 
                        jsonb_build_object(
                            'before', old_data,
                            'after', new_data,
                            'changes', (
                                SELECT jsonb_object_agg(key, value)
                                FROM jsonb_each(new_data)
                                WHERE value IS DISTINCT FROM (old_data->key)
                            )
                        )
                    WHEN operation_type = 'INSERT' THEN new_data
                    WHEN operation_type = 'DELETE' THEN old_data
                END
        ) as event_payload

    FROM unprocessed_events ue
),

service_notifications AS (
    SELECT 
        epp.event_id,
        epp.correlation_id,
        epp.event_payload,
        unnest(epp.target_services) as service_name,
        epp.priority,

        -- Service-specific payload customization
        CASE 
            WHEN unnest(epp.target_services) = 'inventory-service' THEN
                epp.event_payload || jsonb_build_object(
                    'inventoryData', 
                    jsonb_build_object(
                        'productId', epp.record_id,
                        'currentCount', (epp.event_payload->'data'->'after'->>'inventory_count')::INTEGER,
                        'previousCount', (epp.event_payload->'data'->'before'->>'inventory_count')::INTEGER
                    )
                )
            WHEN unnest(epp.target_services) = 'search-service' THEN
                epp.event_payload || jsonb_build_object(
                    'searchData',
                    jsonb_build_object(
                        'productId', epp.record_id,
                        'name', epp.event_payload->'data'->'after'->>'name',
                        'description', epp.event_payload->'data'->'after'->>'description',
                        'category', epp.event_payload->'data'->'after'->>'category_id',
                        'status', epp.event_payload->'data'->'after'->>'status'
                    )
                )
            ELSE epp.event_payload
        END as service_payload

    FROM event_processing_plan epp
)

SELECT 
    event_id,
    correlation_id,
    service_name,
    priority,
    service_payload,

    -- Generate webhook URLs or message queue topics
    CASE service_name
        WHEN 'inventory-service' THEN 'http://inventory-service/webhook/product-change'
        WHEN 'catalog-service' THEN 'http://catalog-service/api/events'
        WHEN 'search-service' THEN 'kafka://search-updates-topic'
        WHEN 'cache-service' THEN 'redis://cache-invalidation'
        ELSE 'http://default-service/webhook'
    END as target_endpoint,

    -- Event processing metadata
    jsonb_build_object(
        'processingAttempt', 1,
        'maxRetries', 3,
        'timeoutSeconds', 30,
        'exponentialBackoff', true
    ) as processing_config

FROM service_notifications
ORDER BY priority ASC, event_id ASC;

-- Update processed events (requires separate transaction)
UPDATE event_log 
SET processed = TRUE,
    updated_at = CURRENT_TIMESTAMP
WHERE event_id IN (
    SELECT event_id FROM unprocessed_events
);

-- Problems with traditional event processing:
-- 1. Complex trigger-based change tracking with limited filtering capabilities
-- 2. Polling-based processing introduces latency and resource waste
-- 3. Manual event routing and service coordination logic
-- 4. Limited scalability due to database-level trigger overhead
-- 5. Complex retry logic and error handling for failed event processing
-- 6. Difficult to implement real-time filtering and transformation
-- 7. No native support for distributed event processing patterns
-- 8. Complex partitioning and cleanup strategies for event log tables
-- 9. Limited integration with microservices and modern event architectures
-- 10. High operational complexity for maintaining event processing infrastructure

MongoDB Change Streams provide comprehensive real-time event processing capabilities:

// MongoDB Change Streams - native real-time event processing for microservices
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('ecommerce_platform');

// Advanced Change Streams Event Processing System
class MongoChangeStreamManager {
  constructor(db) {
    this.db = db;
    this.changeStreams = new Map();
    this.eventHandlers = new Map();
    this.processingMetrics = new Map();

    // Event routing configuration
    this.eventRoutes = new Map([
      ['products', ['inventory-service', 'catalog-service', 'search-service', 'cache-service']],
      ['orders', ['fulfillment-service', 'payment-service', 'notification-service']],
      ['customers', ['profile-service', 'marketing-service', 'analytics-service']],
      ['inventory', ['warehouse-service', 'alert-service', 'reporting-service']]
    ]);

    this.serviceEndpoints = new Map([
      ['inventory-service', 'http://inventory-service:3001/webhook/events'],
      ['catalog-service', 'http://catalog-service:3002/api/events'],
      ['search-service', 'http://search-service:3003/events/index'],
      ['cache-service', 'redis://cache-cluster:6379/invalidate'],
      ['fulfillment-service', 'http://fulfillment:3004/orders/events'],
      ['payment-service', 'http://payments:3005/webhook/order-events'],
      ['notification-service', 'http://notifications:3006/events/send'],
      ['profile-service', 'http://profiles:3007/customers/events'],
      ['marketing-service', 'http://marketing:3008/events/customer'],
      ['analytics-service', 'kafka://analytics-cluster/customer-events']
    ]);
  }

  async setupComprehensiveChangeStreams() {
    console.log('Setting up comprehensive change streams for microservices architecture...');

    // Product catalog change stream with intelligent filtering
    await this.createProductChangeStream();

    // Order processing change stream
    await this.createOrderChangeStream();

    // Customer data change stream
    await this.createCustomerChangeStream();

    // Inventory management change stream
    await this.createInventoryChangeStream();

    // Cross-collection aggregated events
    await this.createAggregatedChangeStream();

    console.log('Change streams initialized for real-time event-driven architecture');
  }

  async createProductChangeStream() {
    console.log('Creating product catalog change stream...');

    const productsCollection = this.db.collection('products');

    // Comprehensive change stream pipeline for product events
    const pipeline = [
      {
        $match: {
          $and: [
            // Only watch specific operation types
            {
              "operationType": { 
                $in: ["insert", "update", "delete", "replace"] 
              }
            },

            // Filter based on significant changes
            {
              $or: [
                // New products
                { "operationType": "insert" },

                // Product deletions
                { "operationType": "delete" },

                // Critical field updates
                {
                  $and: [
                    { "operationType": "update" },
                    {
                      $or: [
                        { "updateDescription.updatedFields.status": { $exists: true } },
                        { "updateDescription.updatedFields.price": { $exists: true } },
                        { "updateDescription.updatedFields.inventory_count": { $exists: true } },
                        { "updateDescription.updatedFields.name": { $exists: true } },
                        { "updateDescription.updatedFields.category": { $exists: true } },
                        { "updateDescription.updatedFields.availability": { $exists: true } }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      },

      // Add computed fields for event processing
      {
        $addFields: {
          // Event classification
          "eventSeverity": {
            $switch: {
              branches: [
                {
                  case: { $eq: ["$operationType", "delete"] },
                  then: "critical"
                },
                {
                  case: {
                    $and: [
                      { $eq: ["$operationType", "update"] },
                      { $ne: ["$updateDescription.updatedFields.status", null] }
                    ]
                  },
                  then: "high"
                },
                {
                  case: {
                    $or: [
                      { $ne: ["$updateDescription.updatedFields.price", null] },
                      { $ne: ["$updateDescription.updatedFields.inventory_count", null] }
                    ]
                  },
                  then: "medium"
                }
              ],
              default: "low"
            }
          },

          // Processing metadata
          "processingMetadata": {
            "streamId": "product-changes",
            "timestamp": "$$NOW",
            "source": "mongodb-change-stream",
            "correlationId": { $toString: "$_id" }
          },

          // Change summary for efficient processing
          "changeSummary": {
            $cond: {
              if: { $eq: ["$operationType", "update"] },
              then: {
                "fieldsChanged": { $objectToArray: "$updateDescription.updatedFields" },
                "fieldsRemoved": "$updateDescription.removedFields",
                "changeCount": { $size: { $objectToArray: "$updateDescription.updatedFields" } }
              },
              else: null
            }
          }
        }
      }
    ];

    const productChangeStream = productsCollection.watch(pipeline, {
      fullDocument: 'updateLookup',
      fullDocumentBeforeChange: 'whenAvailable'
    });

    // Event handler for product changes
    productChangeStream.on('change', async (change) => {
      try {
        await this.handleProductChange(change);
      } catch (error) {
        console.error('Error handling product change:', error);
        await this.handleEventProcessingError('products', change, error);
      }
    });

    productChangeStream.on('error', (error) => {
      console.error('Product change stream error:', error);
      this.handleChangeStreamError('products', error);
    });

    this.changeStreams.set('products', productChangeStream);
    console.log('✅ Product change stream active');
  }

  async handleProductChange(change) {
    console.log(`Processing product change: ${change.operationType} for product ${change.documentKey._id}`);

    const eventPayload = {
      eventId: change._id.toString(),
      eventType: 'product_change',
      operationType: change.operationType,
      timestamp: new Date(),
      correlationId: change.processingMetadata?.correlationId,
      severity: change.eventSeverity,

      // Document data
      documentId: change.documentKey._id,
      fullDocument: change.fullDocument,
      fullDocumentBeforeChange: change.fullDocumentBeforeChange,

      // Change details
      updateDescription: change.updateDescription,
      changeSummary: change.changeSummary,

      // Event-specific data extraction
      productData: this.extractProductEventData(change),

      // Processing metadata
      processingMetadata: {
        ...change.processingMetadata,
        targetServices: this.eventRoutes.get('products') || [],
        retryPolicy: {
          maxRetries: 3,
          backoffMultiplier: 2,
          initialDelayMs: 1000
        }
      }
    };

    // Route event to appropriate microservices
    const targetServices = this.eventRoutes.get('products') || [];
    await this.routeEventToServices(eventPayload, targetServices);

    // Update processing metrics
    this.updateProcessingMetrics('products', change.operationType, 'success');
  }

  extractProductEventData(change) {
    const productData = {
      productId: change.documentKey._id,
      operation: change.operationType
    };

    switch (change.operationType) {
      case 'insert':
        productData.newProduct = {
          sku: change.fullDocument?.sku,
          name: change.fullDocument?.name,
          category: change.fullDocument?.category,
          price: change.fullDocument?.price,
          status: change.fullDocument?.status,
          inventory_count: change.fullDocument?.inventory_count
        };
        break;

      case 'update':
        productData.changes = {};

        // Extract specific field changes
        if (change.updateDescription?.updatedFields) {
          const updatedFields = change.updateDescription.updatedFields;

          if ('price' in updatedFields) {
            productData.changes.priceChange = {
              oldPrice: change.fullDocumentBeforeChange?.price,
              newPrice: updatedFields.price
            };
          }

          if ('inventory_count' in updatedFields) {
            productData.changes.inventoryChange = {
              oldCount: change.fullDocumentBeforeChange?.inventory_count,
              newCount: updatedFields.inventory_count,
              delta: updatedFields.inventory_count - (change.fullDocumentBeforeChange?.inventory_count || 0)
            };
          }

          if ('status' in updatedFields) {
            productData.changes.statusChange = {
              oldStatus: change.fullDocumentBeforeChange?.status,
              newStatus: updatedFields.status,
              isActivation: updatedFields.status === 'active' && change.fullDocumentBeforeChange?.status !== 'active',
              isDeactivation: updatedFields.status !== 'active' && change.fullDocumentBeforeChange?.status === 'active'
            };
          }
        }

        productData.currentState = change.fullDocument;
        break;

      case 'delete':
        productData.deletedProduct = {
          sku: change.fullDocumentBeforeChange?.sku,
          name: change.fullDocumentBeforeChange?.name,
          category: change.fullDocumentBeforeChange?.category
        };
        break;
    }

    return productData;
  }

  async createOrderChangeStream() {
    console.log('Creating order processing change stream...');

    const ordersCollection = this.db.collection('orders');

    const pipeline = [
      {
        $match: {
          $or: [
            // New orders
            { "operationType": "insert" },

            // Order status changes
            {
              $and: [
                { "operationType": "update" },
                { "updateDescription.updatedFields.status": { $exists: true } }
              ]
            },

            // Payment status changes
            {
              $and: [
                { "operationType": "update" },
                { "updateDescription.updatedFields.payment.status": { $exists: true } }
              ]
            },

            // Shipping information updates
            {
              $and: [
                { "operationType": "update" },
                {
                  $or: [
                    { "updateDescription.updatedFields.shipping.trackingNumber": { $exists: true } },
                    { "updateDescription.updatedFields.shipping.status": { $exists: true } },
                    { "updateDescription.updatedFields.shipping.actualDelivery": { $exists: true } }
                  ]
                }
              ]
            }
          ]
        }
      },

      {
        $addFields: {
          "eventType": {
            $switch: {
              branches: [
                { case: { $eq: ["$operationType", "insert"] }, then: "order_created" },
                {
                  case: {
                    $and: [
                      { $eq: ["$operationType", "update"] },
                      { $ne: ["$updateDescription.updatedFields.status", null] }
                    ]
                  },
                  then: "order_status_changed"
                },
                {
                  case: {
                    $ne: ["$updateDescription.updatedFields.payment.status", null]
                  },
                  then: "payment_status_changed"
                },
                {
                  case: {
                    $or: [
                      { $ne: ["$updateDescription.updatedFields.shipping.trackingNumber", null] },
                      { $ne: ["$updateDescription.updatedFields.shipping.status", null] }
                    ]
                  },
                  then: "shipping_updated"
                }
              ],
              default: "order_modified"
            }
          },

          "urgencyLevel": {
            $switch: {
              branches: [
                {
                  case: {
                    $and: [
                      { $eq: ["$operationType", "update"] },
                      { $eq: ["$updateDescription.updatedFields.status", "cancelled"] }
                    ]
                  },
                  then: "high"
                },
                {
                  case: {
                    $or: [
                      { $eq: ["$updateDescription.updatedFields.payment.status", "failed"] },
                      { $eq: ["$updateDescription.updatedFields.status", "processing"] }
                    ]
                  },
                  then: "medium"
                }
              ],
              default: "normal"
            }
          }
        }
      }
    ];

    const orderChangeStream = ordersCollection.watch(pipeline, {
      fullDocument: 'updateLookup',
      fullDocumentBeforeChange: 'whenAvailable'
    });

    orderChangeStream.on('change', async (change) => {
      try {
        await this.handleOrderChange(change);
      } catch (error) {
        console.error('Error handling order change:', error);
        await this.handleEventProcessingError('orders', change, error);
      }
    });

    this.changeStreams.set('orders', orderChangeStream);
    console.log('✅ Order change stream active');
  }

  async handleOrderChange(change) {
    console.log(`Processing order change: ${change.eventType} for order ${change.documentKey._id}`);

    const eventPayload = {
      eventId: change._id.toString(),
      eventType: change.eventType,
      operationType: change.operationType,
      urgencyLevel: change.urgencyLevel,
      timestamp: new Date(),

      orderId: change.documentKey._id,
      orderData: this.extractOrderEventData(change),

      // Customer information for notifications
      customerInfo: {
        customerId: change.fullDocument?.customer?.customerId,
        email: change.fullDocument?.customer?.email,
        name: change.fullDocument?.customer?.name
      },

      processingMetadata: {
        targetServices: this.determineOrderTargetServices(change),
        correlationId: change.fullDocument?.correlationId || change._id.toString()
      }
    };

    await this.routeEventToServices(eventPayload, eventPayload.processingMetadata.targetServices);
    this.updateProcessingMetrics('orders', change.operationType, 'success');
  }

  extractOrderEventData(change) {
    const orderData = {
      orderId: change.documentKey._id,
      operation: change.operationType,
      eventType: change.eventType
    };

    if (change.operationType === 'insert') {
      orderData.newOrder = {
        orderNumber: change.fullDocument?.orderNumber,
        customerId: change.fullDocument?.customer?.customerId,
        totalAmount: change.fullDocument?.totals?.grandTotal,
        status: change.fullDocument?.status,
        itemCount: change.fullDocument?.items?.length || 0,
        priority: change.fullDocument?.priority
      };
    }

    if (change.operationType === 'update' && change.updateDescription?.updatedFields) {
      orderData.changes = {};
      const fields = change.updateDescription.updatedFields;

      if ('status' in fields) {
        orderData.changes.statusChange = {
          from: change.fullDocumentBeforeChange?.status,
          to: fields.status,
          timestamp: new Date()
        };
      }

      if ('payment.status' in fields || fields['payment.status']) {
        orderData.changes.paymentStatusChange = {
          from: change.fullDocumentBeforeChange?.payment?.status,
          to: fields['payment.status'] || fields.payment?.status,
          paymentMethod: change.fullDocument?.payment?.method
        };
      }
    }

    return orderData;
  }

  determineOrderTargetServices(change) {
    const baseServices = ['fulfillment-service', 'notification-service'];

    if (change.eventType === 'payment_status_changed') {
      baseServices.push('payment-service');
    }

    if (change.eventType === 'shipping_updated') {
      baseServices.push('shipping-service', 'tracking-service');
    }

    if (change.urgencyLevel === 'high') {
      baseServices.push('alert-service');
    }

    return baseServices;
  }

  async createCustomerChangeStream() {
    console.log('Creating customer data change stream...');

    const customersCollection = this.db.collection('customers');

    const pipeline = [
      {
        $match: {
          $or: [
            { "operationType": "insert" },
            {
              $and: [
                { "operationType": "update" },
                {
                  $or: [
                    { "updateDescription.updatedFields.email": { $exists: true } },
                    { "updateDescription.updatedFields.tier": { $exists: true } },
                    { "updateDescription.updatedFields.preferences": { $exists: true } },
                    { "updateDescription.updatedFields.status": { $exists: true } }
                  ]
                }
              ]
            }
          ]
        }
      }
    ];

    const customerChangeStream = customersCollection.watch(pipeline, {
      fullDocument: 'updateLookup'
    });

    customerChangeStream.on('change', async (change) => {
      try {
        await this.handleCustomerChange(change);
      } catch (error) {
        console.error('Error handling customer change:', error);
        await this.handleEventProcessingError('customers', change, error);
      }
    });

    this.changeStreams.set('customers', customerChangeStream);
    console.log('✅ Customer change stream active');
  }

  async handleCustomerChange(change) {
    const eventPayload = {
      eventId: change._id.toString(),
      eventType: 'customer_change',
      operationType: change.operationType,
      timestamp: new Date(),

      customerId: change.documentKey._id,
      customerData: {
        email: change.fullDocument?.email,
        name: change.fullDocument?.name,
        tier: change.fullDocument?.tier,
        status: change.fullDocument?.status
      },

      processingMetadata: {
        targetServices: ['profile-service', 'marketing-service', 'analytics-service'],
        isNewCustomer: change.operationType === 'insert'
      }
    };

    await this.routeEventToServices(eventPayload, eventPayload.processingMetadata.targetServices);
  }

  async routeEventToServices(eventPayload, targetServices) {
    console.log(`Routing event ${eventPayload.eventId} to services: ${targetServices.join(', ')}`);

    const routingPromises = targetServices.map(async (serviceName) => {
      try {
        const endpoint = this.serviceEndpoints.get(serviceName);
        if (!endpoint) {
          console.warn(`No endpoint configured for service: ${serviceName}`);
          return;
        }

        const servicePayload = this.customizePayloadForService(eventPayload, serviceName);
        await this.sendEventToService(serviceName, endpoint, servicePayload);

        console.log(`✅ Event sent to ${serviceName}`);
      } catch (error) {
        console.error(`❌ Failed to send event to ${serviceName}:`, error.message);
        await this.handleServiceDeliveryError(serviceName, eventPayload, error);
      }
    });

    await Promise.allSettled(routingPromises);
  }

  customizePayloadForService(eventPayload, serviceName) {
    // Clone base payload
    const servicePayload = {
      ...eventPayload,
      targetService: serviceName,
      deliveryTimestamp: new Date()
    };

    // Service-specific customization
    switch (serviceName) {
      case 'inventory-service':
        if (eventPayload.productData) {
          servicePayload.inventoryData = {
            productId: eventPayload.productData.productId,
            inventoryChange: eventPayload.productData.changes?.inventoryChange,
            currentCount: eventPayload.fullDocument?.inventory_count,
            lowStockThreshold: eventPayload.fullDocument?.low_stock_threshold
          };
        }
        break;

      case 'search-service':
        if (eventPayload.productData) {
          servicePayload.searchData = {
            productId: eventPayload.productData.productId,
            indexOperation: eventPayload.operationType === 'delete' ? 'remove' : 'upsert',
            document: eventPayload.operationType !== 'delete' ? {
              name: eventPayload.fullDocument?.name,
              description: eventPayload.fullDocument?.description,
              category: eventPayload.fullDocument?.category,
              tags: eventPayload.fullDocument?.tags,
              searchable: eventPayload.fullDocument?.status === 'active'
            } : null
          };
        }
        break;

      case 'notification-service':
        if (eventPayload.customerInfo) {
          servicePayload.notificationData = {
            recipientEmail: eventPayload.customerInfo.email,
            recipientName: eventPayload.customerInfo.name,
            notificationType: this.determineNotificationType(eventPayload),
            priority: eventPayload.urgencyLevel || 'normal',
            templateData: this.buildNotificationTemplateData(eventPayload)
          };
        }
        break;

      case 'cache-service':
        servicePayload.cacheOperations = this.determineCacheOperations(eventPayload);
        break;
    }

    return servicePayload;
  }

  async sendEventToService(serviceName, endpoint, payload) {
    if (endpoint.startsWith('http://') || endpoint.startsWith('https://')) {
      // HTTP webhook delivery
      const response = await fetch(endpoint, {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
          'X-Event-Source': 'mongodb-change-stream',
          'X-Event-ID': payload.eventId,
          'X-Correlation-ID': payload.processingMetadata?.correlationId
        },
        body: JSON.stringify(payload),
        timeout: 10000 // 10 second timeout
      });

      if (!response.ok) {
        throw new Error(`HTTP ${response.status}: ${response.statusText}`);
      }

    } else if (endpoint.startsWith('kafka://')) {
      // Kafka message delivery (mock implementation)
      await this.sendToKafka(endpoint, payload);

    } else if (endpoint.startsWith('redis://')) {
      // Redis cache operations (mock implementation)
      await this.sendToRedis(endpoint, payload);
    }
  }

  async sendToKafka(endpoint, payload) {
    // Mock Kafka implementation
    console.log(`[KAFKA] Sending to ${endpoint}:`, JSON.stringify(payload, null, 2));
  }

  async sendToRedis(endpoint, payload) {
    // Mock Redis implementation
    console.log(`[REDIS] Cache operation at ${endpoint}:`, JSON.stringify(payload.cacheOperations, null, 2));
  }

  determineNotificationType(eventPayload) {
    switch (eventPayload.eventType) {
      case 'order_created': return 'order_confirmation';
      case 'order_status_changed': 
        if (eventPayload.orderData?.changes?.statusChange?.to === 'shipped') return 'order_shipped';
        if (eventPayload.orderData?.changes?.statusChange?.to === 'delivered') return 'order_delivered';
        return 'order_update';
      case 'payment_status_changed': return 'payment_update';
      default: return 'general_update';
    }
  }

  buildNotificationTemplateData(eventPayload) {
    const templateData = {
      eventType: eventPayload.eventType,
      timestamp: eventPayload.timestamp
    };

    if (eventPayload.orderData) {
      templateData.order = {
        id: eventPayload.orderId,
        number: eventPayload.orderData.newOrder?.orderNumber,
        status: eventPayload.orderData.changes?.statusChange?.to || eventPayload.orderData.newOrder?.status,
        total: eventPayload.orderData.newOrder?.totalAmount
      };
    }

    return templateData;
  }

  determineCacheOperations(eventPayload) {
    const operations = [];

    if (eventPayload.eventType === 'product_change') {
      operations.push({
        operation: 'invalidate',
        keys: [
          `product:${eventPayload.productData?.productId}`,
          `products:category:${eventPayload.fullDocument?.category}`,
          'products:featured',
          'products:search:*'
        ]
      });
    }

    if (eventPayload.eventType === 'order_created' || eventPayload.eventType.includes('order_')) {
      operations.push({
        operation: 'invalidate',
        keys: [
          `customer:${eventPayload.customerInfo?.customerId}:orders`,
          `order:${eventPayload.orderId}`
        ]
      });
    }

    return operations;
  }

  async createAggregatedChangeStream() {
    console.log('Creating aggregated change stream for cross-collection events...');

    // Watch multiple collections for coordinated events
    const aggregatedPipeline = [
      {
        $match: {
          $and: [
            {
              "ns.coll": { $in: ["products", "orders", "inventory"] }
            },
            {
              $or: [
                { "operationType": "insert" },
                { "operationType": "update" },
                { "operationType": "delete" }
              ]
            }
          ]
        }
      },

      {
        $addFields: {
          "crossCollectionEventType": {
            $switch: {
              branches: [
                {
                  case: {
                    $and: [
                      { $eq: ["$ns.coll", "orders"] },
                      { $eq: ["$operationType", "insert"] }
                    ]
                  },
                  then: "new_order_created"
                },
                {
                  case: {
                    $and: [
                      { $eq: ["$ns.coll", "inventory"] },
                      { $lt: ["$fullDocument.quantity", 10] }
                    ]
                  },
                  then: "low_stock_alert"
                },
                {
                  case: {
                    $and: [
                      { $eq: ["$ns.coll", "products"] },
                      { $eq: ["$updateDescription.updatedFields.status", "discontinued"] }
                    ]
                  },
                  then: "product_discontinued"
                }
              ],
              default: "standard_change"
            }
          }
        }
      }
    ];

    const aggregatedChangeStream = this.db.watch(aggregatedPipeline, {
      fullDocument: 'updateLookup'
    });

    aggregatedChangeStream.on('change', async (change) => {
      try {
        await this.handleCrossCollectionEvent(change);
      } catch (error) {
        console.error('Error handling cross-collection event:', error);
      }
    });

    this.changeStreams.set('aggregated', aggregatedChangeStream);
    console.log('✅ Aggregated change stream active');
  }

  async handleCrossCollectionEvent(change) {
    console.log(`Processing cross-collection event: ${change.crossCollectionEventType}`);

    if (change.crossCollectionEventType === 'new_order_created') {
      // Trigger inventory reservation
      await this.triggerInventoryReservation(change.fullDocument);
    } else if (change.crossCollectionEventType === 'low_stock_alert') {
      // Send low stock notifications
      await this.triggerLowStockAlert(change.fullDocument);
    } else if (change.crossCollectionEventType === 'product_discontinued') {
      // Handle product discontinuation workflow
      await this.handleProductDiscontinuation(change.documentKey._id);
    }
  }

  async triggerInventoryReservation(order) {
    console.log(`Triggering inventory reservation for order ${order._id}`);
    // Implementation would coordinate with inventory service
  }

  async triggerLowStockAlert(inventoryRecord) {
    console.log(`Triggering low stock alert for product ${inventoryRecord.productId}`);
    // Implementation would send alerts to purchasing team
  }

  async handleProductDiscontinuation(productId) {
    console.log(`Handling product discontinuation for ${productId}`);
    // Implementation would update related systems and cancel pending orders
  }

  updateProcessingMetrics(collection, operation, status) {
    const key = `${collection}-${operation}`;
    const current = this.processingMetrics.get(key) || { success: 0, error: 0 };
    current[status]++;
    this.processingMetrics.set(key, current);
  }

  async handleEventProcessingError(collection, change, error) {
    console.error(`Event processing error in ${collection}:`, error);

    // Log error for monitoring
    await this.db.collection('event_processing_errors').insertOne({
      collection,
      changeId: change._id,
      error: error.message,
      timestamp: new Date(),
      changeDetails: {
        operationType: change.operationType,
        documentKey: change.documentKey
      }
    });

    this.updateProcessingMetrics(collection, change.operationType, 'error');
  }

  async handleServiceDeliveryError(serviceName, eventPayload, error) {
    // Implement retry logic
    const retryKey = `${serviceName}-${eventPayload.eventId}`;
    console.warn(`Service delivery failed for ${serviceName}, scheduling retry...`);

    // Store for retry processing (implementation would use a proper queue)
    setTimeout(async () => {
      try {
        const endpoint = this.serviceEndpoints.get(serviceName);
        const servicePayload = this.customizePayloadForService(eventPayload, serviceName);
        await this.sendEventToService(serviceName, endpoint, servicePayload);
        console.log(`✅ Retry successful for ${serviceName}`);
      } catch (retryError) {
        console.error(`❌ Retry failed for ${serviceName}:`, retryError.message);
      }
    }, 5000); // 5 second retry delay
  }

  handleChangeStreamError(streamName, error) {
    console.error(`Change stream error for ${streamName}:`, error);

    // Implement stream recovery logic
    setTimeout(() => {
      console.log(`Attempting to recover change stream: ${streamName}`);
      // Recovery implementation would recreate the stream
    }, 10000);
  }

  async getProcessingMetrics() {
    const metrics = {
      activeStreams: Array.from(this.changeStreams.keys()),
      processingStats: Object.fromEntries(this.processingMetrics),
      timestamp: new Date()
    };

    return metrics;
  }

  async shutdown() {
    console.log('Shutting down change streams...');

    for (const [streamName, stream] of this.changeStreams) {
      await stream.close();
      console.log(`✅ Closed change stream: ${streamName}`);
    }

    this.changeStreams.clear();
    console.log('All change streams closed');
  }
}

// Export the change stream manager
module.exports = { MongoChangeStreamManager };

// Benefits of MongoDB Change Streams for Microservices:
// - Real-time event processing without polling overhead
// - Comprehensive filtering and transformation capabilities at the database level
// - Native support for microservices event routing and coordination
// - Automatic retry and error handling for distributed event processing
// - Cross-collection event aggregation for complex business workflows
// - Integration with existing MongoDB infrastructure without additional components
// - Scalable event processing that grows with your data and application needs
// - Built-in support for event ordering and consistency guarantees
// - Comprehensive monitoring and metrics for event processing pipelines
// - SQL-familiar event processing patterns through QueryLeaf integration

Understanding MongoDB Change Streams Architecture

Real-Time Event Processing Patterns

MongoDB Change Streams enable sophisticated real-time architectures with comprehensive event processing capabilities:

// Advanced event processing patterns for production microservices
class AdvancedEventProcessor {
  constructor(db) {
    this.db = db;
    this.eventProcessors = new Map();
    this.eventFilters = new Map();
    this.businessRules = new Map();
  }

  async setupEventDrivenWorkflows() {
    console.log('Setting up advanced event-driven workflows...');

    // Workflow 1: Order fulfillment coordination
    await this.createOrderFulfillmentWorkflow();

    // Workflow 2: Inventory management automation
    await this.createInventoryManagementWorkflow();

    // Workflow 3: Customer lifecycle events
    await this.createCustomerLifecycleWorkflow();

    // Workflow 4: Real-time analytics triggers
    await this.createAnalyticsTriggerWorkflow();

    console.log('Event-driven workflows active');
  }

  async createOrderFulfillmentWorkflow() {
    console.log('Creating order fulfillment workflow...');

    // Multi-stage fulfillment process triggered by order changes
    const fulfillmentPipeline = [
      {
        $match: {
          $and: [
            { "ns.coll": "orders" },
            {
              $or: [
                // New order created
                { "operationType": "insert" },

                // Order status progression
                {
                  $and: [
                    { "operationType": "update" },
                    {
                      "updateDescription.updatedFields.status": {
                        $in: ["confirmed", "processing", "fulfilling", "shipped"]
                      }
                    }
                  ]
                },

                // Payment confirmation
                {
                  $and: [
                    { "operationType": "update" },
                    { "updateDescription.updatedFields.payment.status": "captured" }
                  ]
                }
              ]
            }
          ]
        }
      },

      {
        $addFields: {
          "workflowStage": {
            $switch: {
              branches: [
                { case: { $eq: ["$operationType", "insert"] }, then: "order_received" },
                { case: { $eq: ["$updateDescription.updatedFields.payment.status", "captured"] }, then: "payment_confirmed" },
                { case: { $eq: ["$updateDescription.updatedFields.status", "confirmed"] }, then: "order_confirmed" },
                { case: { $eq: ["$updateDescription.updatedFields.status", "processing"] }, then: "processing_started" },
                { case: { $eq: ["$updateDescription.updatedFields.status", "fulfilling"] }, then: "fulfillment_started" },
                { case: { $eq: ["$updateDescription.updatedFields.status", "shipped"] }, then: "order_shipped" }
              ],
              default: "unknown_stage"
            }
          },

          "nextActions": {
            $switch: {
              branches: [
                { 
                  case: { $eq: ["$operationType", "insert"] },
                  then: ["validate_inventory", "process_payment", "send_confirmation"]
                },
                { 
                  case: { $eq: ["$updateDescription.updatedFields.payment.status", "captured"] },
                  then: ["reserve_inventory", "generate_pick_list", "notify_warehouse"]
                },
                { 
                  case: { $eq: ["$updateDescription.updatedFields.status", "processing"] },
                  then: ["allocate_warehouse", "schedule_picking", "update_eta"]
                },
                { 
                  case: { $eq: ["$updateDescription.updatedFields.status", "shipped"] },
                  then: ["send_tracking", "schedule_delivery_updates", "prepare_feedback_request"]
                }
              ],
              default: []
            }
          }
        }
      }
    ];

    const fulfillmentStream = this.db.watch(fulfillmentPipeline, {
      fullDocument: 'updateLookup',
      fullDocumentBeforeChange: 'whenAvailable'
    });

    fulfillmentStream.on('change', async (change) => {
      await this.processFulfillmentWorkflow(change);
    });

    this.eventProcessors.set('fulfillment', fulfillmentStream);
  }

  async processFulfillmentWorkflow(change) {
    const workflowContext = {
      orderId: change.documentKey._id,
      stage: change.workflowStage,
      nextActions: change.nextActions,
      orderData: change.fullDocument,
      timestamp: new Date()
    };

    console.log(`Processing fulfillment workflow: ${workflowContext.stage} for order ${workflowContext.orderId}`);

    // Execute next actions based on workflow stage
    for (const action of workflowContext.nextActions) {
      try {
        await this.executeWorkflowAction(action, workflowContext);
      } catch (error) {
        console.error(`Failed to execute workflow action ${action}:`, error);
        await this.handleWorkflowError(workflowContext, action, error);
      }
    }

    // Record workflow progress
    await this.recordWorkflowProgress(workflowContext);
  }

  async executeWorkflowAction(action, context) {
    console.log(`Executing workflow action: ${action}`);

    const actionHandlers = {
      'validate_inventory': () => this.validateInventoryAvailability(context),
      'process_payment': () => this.initiatePaymentProcessing(context),
      'send_confirmation': () => this.sendOrderConfirmation(context),
      'reserve_inventory': () => this.reserveInventoryItems(context),
      'generate_pick_list': () => this.generateWarehousePickList(context),
      'notify_warehouse': () => this.notifyWarehouseSystems(context),
      'allocate_warehouse': () => this.allocateOptimalWarehouse(context),
      'schedule_picking': () => this.schedulePickingSlot(context),
      'update_eta': () => this.updateEstimatedDelivery(context),
      'send_tracking': () => this.sendTrackingInformation(context),
      'schedule_delivery_updates': () => this.scheduleDeliveryNotifications(context),
      'prepare_feedback_request': () => this.prepareFeedbackCollection(context)
    };

    const handler = actionHandlers[action];
    if (handler) {
      await handler();
    } else {
      console.warn(`No handler found for workflow action: ${action}`);
    }
  }

  async createInventoryManagementWorkflow() {
    console.log('Creating inventory management workflow...');

    const inventoryPipeline = [
      {
        $match: {
          $and: [
            {
              $or: [
                { "ns.coll": "products" },
                { "ns.coll": "inventory" },
                { "ns.coll": "orders" }
              ]
            },
            {
              $or: [
                // Product inventory updates
                {
                  $and: [
                    { "ns.coll": "products" },
                    { "updateDescription.updatedFields.inventory_count": { $exists: true } }
                  ]
                },

                // Direct inventory updates  
                {
                  $and: [
                    { "ns.coll": "inventory" },
                    { "operationType": "update" }
                  ]
                },

                // New orders affecting inventory
                {
                  $and: [
                    { "ns.coll": "orders" },
                    { "operationType": "insert" }
                  ]
                }
              ]
            }
          ]
        }
      },

      {
        $addFields: {
          "inventoryEventType": {
            $switch: {
              branches: [
                {
                  case: {
                    $and: [
                      { $eq: ["$ns.coll", "products"] },
                      { $lt: ["$updateDescription.updatedFields.inventory_count", 10] }
                    ]
                  },
                  then: "low_stock_alert"
                },
                {
                  case: {
                    $and: [
                      { $eq: ["$ns.coll", "products"] },
                      { $eq: ["$updateDescription.updatedFields.inventory_count", 0] }
                    ]
                  },
                  then: "out_of_stock"
                },
                {
                  case: {
                    $and: [
                      { $eq: ["$ns.coll", "orders"] },
                      { $eq: ["$operationType", "insert"] }
                    ]
                  },
                  then: "inventory_reservation_needed"
                }
              ],
              default: "inventory_change"
            }
          }
        }
      }
    ];

    const inventoryStream = this.db.watch(inventoryPipeline, {
      fullDocument: 'updateLookup'
    });

    inventoryStream.on('change', async (change) => {
      await this.processInventoryWorkflow(change);
    });

    this.eventProcessors.set('inventory', inventoryStream);
  }

  async processInventoryWorkflow(change) {
    const eventType = change.inventoryEventType;

    console.log(`Processing inventory workflow: ${eventType}`);

    switch (eventType) {
      case 'low_stock_alert':
        await this.handleLowStockAlert(change);
        break;

      case 'out_of_stock':
        await this.handleOutOfStock(change);
        break;

      case 'inventory_reservation_needed':
        await this.handleInventoryReservation(change);
        break;

      default:
        await this.handleGeneralInventoryChange(change);
    }
  }

  async handleLowStockAlert(change) {
    const productId = change.documentKey._id;
    const currentCount = change.updateDescription?.updatedFields?.inventory_count;

    console.log(`Low stock alert: Product ${productId} has ${currentCount} units remaining`);

    // Trigger multiple actions
    await Promise.all([
      this.notifyPurchasingTeam(productId, currentCount),
      this.updateProductVisibility(productId, 'low_stock'),
      this.triggerReplenishmentOrder(productId),
      this.notifyCustomersOnWaitlist(productId)
    ]);
  }

  async handleOutOfStock(change) {
    const productId = change.documentKey._id;

    console.log(`Out of stock: Product ${productId}`);

    await Promise.all([
      this.updateProductStatus(productId, 'out_of_stock'),
      this.pauseMarketingCampaigns(productId),
      this.notifyCustomersBackorder(productId),
      this.createEmergencyReplenishment(productId)
    ]);
  }

  async createCustomerLifecycleWorkflow() {
    console.log('Creating customer lifecycle workflow...');

    const customerPipeline = [
      {
        $match: {
          $and: [
            {
              $or: [
                { "ns.coll": "customers" },
                { "ns.coll": "orders" }
              ]
            },
            {
              $or: [
                // New customer registration
                {
                  $and: [
                    { "ns.coll": "customers" },
                    { "operationType": "insert" }
                  ]
                },

                // Customer tier changes
                {
                  $and: [
                    { "ns.coll": "customers" },
                    { "updateDescription.updatedFields.tier": { $exists: true } }
                  ]
                },

                // First order placement
                {
                  $and: [
                    { "ns.coll": "orders" },
                    { "operationType": "insert" }
                  ]
                }
              ]
            }
          ]
        }
      }
    ];

    const customerStream = this.db.watch(customerPipeline, {
      fullDocument: 'updateLookup'
    });

    customerStream.on('change', async (change) => {
      await this.processCustomerLifecycleEvent(change);
    });

    this.eventProcessors.set('customer_lifecycle', customerStream);
  }

  async processCustomerLifecycleEvent(change) {
    if (change.ns.coll === 'customers' && change.operationType === 'insert') {
      await this.handleNewCustomerOnboarding(change.fullDocument);
    } else if (change.ns.coll === 'orders' && change.operationType === 'insert') {
      await this.handleCustomerOrderPlaced(change.fullDocument);
    }
  }

  async handleNewCustomerOnboarding(customer) {
    console.log(`Starting onboarding workflow for new customer: ${customer._id}`);

    const onboardingTasks = [
      { action: 'send_welcome_email', delay: 0 },
      { action: 'create_loyalty_account', delay: 1000 },
      { action: 'suggest_initial_products', delay: 5000 },
      { action: 'schedule_follow_up', delay: 86400000 } // 24 hours
    ];

    for (const task of onboardingTasks) {
      setTimeout(async () => {
        await this.executeCustomerAction(task.action, customer);
      }, task.delay);
    }
  }

  async executeCustomerAction(action, customer) {
    console.log(`Executing customer action: ${action} for customer ${customer._id}`);

    const actionHandlers = {
      'send_welcome_email': () => this.sendWelcomeEmail(customer),
      'create_loyalty_account': () => this.createLoyaltyAccount(customer),
      'suggest_initial_products': () => this.suggestProducts(customer),
      'schedule_follow_up': () => this.scheduleFollowUp(customer)
    };

    const handler = actionHandlers[action];
    if (handler) {
      await handler();
    }
  }

  // Service integration methods (mock implementations)
  async validateInventoryAvailability(context) {
    console.log(`✅ Validating inventory for order ${context.orderId}`);
  }

  async initiatePaymentProcessing(context) {
    console.log(`✅ Initiating payment processing for order ${context.orderId}`);
  }

  async sendOrderConfirmation(context) {
    console.log(`✅ Sending order confirmation for order ${context.orderId}`);
  }

  async notifyPurchasingTeam(productId, currentCount) {
    console.log(`✅ Notifying purchasing team: Product ${productId} has ${currentCount} units`);
  }

  async sendWelcomeEmail(customer) {
    console.log(`✅ Sending welcome email to ${customer.email}`);
  }

  async recordWorkflowProgress(context) {
    await this.db.collection('workflow_progress').insertOne({
      orderId: context.orderId,
      stage: context.stage,
      actions: context.nextActions,
      timestamp: context.timestamp,
      status: 'completed'
    });
  }

  async handleWorkflowError(context, action, error) {
    console.error(`Workflow error in ${action} for order ${context.orderId}:`, error.message);

    await this.db.collection('workflow_errors').insertOne({
      orderId: context.orderId,
      stage: context.stage,
      failedAction: action,
      error: error.message,
      timestamp: new Date(),
      retryCount: 0
    });
  }

  async getWorkflowMetrics() {
    const activeProcessors = Array.from(this.eventProcessors.keys());

    return {
      activeWorkflows: activeProcessors.length,
      processorNames: activeProcessors,
      timestamp: new Date()
    };
  }

  async shutdown() {
    console.log('Shutting down event processors...');

    for (const [name, processor] of this.eventProcessors) {
      await processor.close();
      console.log(`✅ Closed event processor: ${name}`);
    }

    this.eventProcessors.clear();
  }
}

// Export the advanced event processor
module.exports = { AdvancedEventProcessor };

SQL-Style Change Stream Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB Change Streams and event processing:

-- QueryLeaf change stream operations with SQL-familiar syntax

-- Create change stream listener with SQL-style syntax
CREATE CHANGE_STREAM product_changes AS
SELECT 
  operation_type,
  document_key,
  full_document,
  full_document_before_change,
  update_description,

  -- Event classification
  CASE 
    WHEN operation_type = 'delete' THEN 'critical'
    WHEN operation_type = 'update' AND update_description.updated_fields ? 'status' THEN 'high'
    WHEN operation_type = 'update' AND (update_description.updated_fields ? 'price' OR update_description.updated_fields ? 'inventory_count') THEN 'medium'
    ELSE 'low'
  END as event_severity,

  -- Change summary
  CASE 
    WHEN operation_type = 'update' THEN 
      JSON_BUILD_OBJECT(
        'fields_changed', JSON_ARRAY_LENGTH(JSON_KEYS(update_description.updated_fields)),
        'key_changes', ARRAY(
          SELECT key FROM JSON_EACH_TEXT(update_description.updated_fields) WHERE key IN ('status', 'price', 'inventory_count')
        )
      )
    ELSE NULL
  END as change_summary,

  CURRENT_TIMESTAMP as processing_timestamp

FROM CHANGE_STREAM('products')
WHERE 
  operation_type IN ('insert', 'update', 'delete')
  AND (
    operation_type != 'update' OR
    (
      update_description.updated_fields ? 'status' OR
      update_description.updated_fields ? 'price' OR  
      update_description.updated_fields ? 'inventory_count' OR
      update_description.updated_fields ? 'name' OR
      update_description.updated_fields ? 'category'
    )
  );

-- Advanced change stream with business rules
CREATE CHANGE_STREAM order_workflow AS
WITH order_events AS (
  SELECT 
    operation_type,
    document_key.order_id,
    full_document,
    update_description,

    -- Workflow stage determination
    CASE 
      WHEN operation_type = 'insert' THEN 'order_created'
      WHEN operation_type = 'update' AND update_description.updated_fields ? 'status' THEN
        CASE update_description.updated_fields.status
          WHEN 'confirmed' THEN 'order_confirmed'
          WHEN 'processing' THEN 'processing_started'
          WHEN 'shipped' THEN 'order_shipped' 
          WHEN 'delivered' THEN 'order_completed'
          WHEN 'cancelled' THEN 'order_cancelled'
          ELSE 'status_updated'
        END
      WHEN operation_type = 'update' AND update_description.updated_fields ? 'payment.status' THEN 'payment_updated'
      ELSE 'order_modified'
    END as workflow_stage,

    -- Priority level
    CASE 
      WHEN operation_type = 'update' AND update_description.updated_fields.status = 'cancelled' THEN 'urgent'
      WHEN operation_type = 'insert' AND full_document.totals.grand_total > 1000 THEN 'high'
      WHEN operation_type = 'update' AND update_description.updated_fields ? 'payment.status' THEN 'medium'
      ELSE 'normal'
    END as priority_level,

    -- Next actions determination
    CASE 
      WHEN operation_type = 'insert' THEN 
        ARRAY['validate_inventory', 'process_payment', 'send_confirmation']
      WHEN operation_type = 'update' AND update_description.updated_fields.payment.status = 'captured' THEN
        ARRAY['reserve_inventory', 'notify_warehouse', 'update_eta']
      WHEN operation_type = 'update' AND update_description.updated_fields.status = 'processing' THEN
        ARRAY['allocate_warehouse', 'generate_pick_list', 'schedule_picking']
      WHEN operation_type = 'update' AND update_description.updated_fields.status = 'shipped' THEN
        ARRAY['send_tracking', 'schedule_delivery_updates', 'prepare_feedback']
      ELSE ARRAY[]::TEXT[]
    END as next_actions,

    CURRENT_TIMESTAMP as event_timestamp

  FROM CHANGE_STREAM('orders')
  WHERE operation_type IN ('insert', 'update')
),

workflow_routing AS (
  SELECT 
    oe.*,

    -- Determine target services based on workflow stage
    CASE workflow_stage
      WHEN 'order_created' THEN 
        ARRAY['inventory-service', 'payment-service', 'notification-service']
      WHEN 'payment_updated' THEN
        ARRAY['payment-service', 'fulfillment-service', 'accounting-service']
      WHEN 'order_shipped' THEN
        ARRAY['shipping-service', 'tracking-service', 'notification-service']
      WHEN 'order_cancelled' THEN
        ARRAY['inventory-service', 'payment-service', 'notification-service', 'analytics-service']
      ELSE ARRAY['fulfillment-service']
    END as target_services,

    -- Service-specific payloads
    JSON_BUILD_OBJECT(
      'event_id', GENERATE_UUID(),
      'event_type', workflow_stage,
      'priority', priority_level,
      'order_id', order_id,
      'customer_id', full_document.customer.customer_id,
      'order_total', full_document.totals.grand_total,
      'next_actions', next_actions,
      'timestamp', event_timestamp
    ) as event_payload

  FROM order_events oe
)

SELECT 
  order_id,
  workflow_stage,
  priority_level,
  UNNEST(target_services) as service_name,
  event_payload,

  -- Service endpoint routing
  CASE UNNEST(target_services)
    WHEN 'inventory-service' THEN 'http://inventory-service:3001/webhook/orders'
    WHEN 'payment-service' THEN 'http://payment-service:3002/events/orders'
    WHEN 'notification-service' THEN 'http://notification-service:3003/events/order'
    WHEN 'fulfillment-service' THEN 'http://fulfillment-service:3004/orders/events'
    WHEN 'shipping-service' THEN 'http://shipping-service:3005/orders/shipping'
    ELSE 'http://default-service:3000/webhook'
  END as target_endpoint,

  -- Delivery configuration
  JSON_BUILD_OBJECT(
    'timeout_ms', 10000,
    'retry_attempts', 3,
    'retry_backoff', 'exponential'
  ) as delivery_config

FROM workflow_routing
WHERE array_length(target_services, 1) > 0
ORDER BY 
  CASE priority_level
    WHEN 'urgent' THEN 1
    WHEN 'high' THEN 2
    WHEN 'medium' THEN 3
    ELSE 4
  END,
  event_timestamp ASC;

-- Cross-collection change aggregation
CREATE CHANGE_STREAM business_events AS
WITH cross_collection_changes AS (
  SELECT 
    namespace.collection as source_collection,
    operation_type,
    document_key,
    full_document,
    update_description,
    CURRENT_TIMESTAMP as change_timestamp

  FROM CHANGE_STREAM_DATABASE()
  WHERE namespace.collection IN ('products', 'orders', 'customers', 'inventory')
),

business_event_classification AS (
  SELECT 
    ccc.*,

    -- Business event type determination
    CASE 
      WHEN source_collection = 'orders' AND operation_type = 'insert' THEN 'new_sale'
      WHEN source_collection = 'customers' AND operation_type = 'insert' THEN 'customer_acquisition'
      WHEN source_collection = 'products' AND operation_type = 'update' AND 
           update_description.updated_fields ? 'inventory_count' AND 
           (update_description.updated_fields.inventory_count)::INTEGER < 10 THEN 'low_inventory'
      WHEN source_collection = 'orders' AND operation_type = 'update' AND
           update_description.updated_fields.status = 'cancelled' THEN 'order_cancellation'
      ELSE 'standard_change'
    END as business_event_type,

    -- Impact level assessment  
    CASE 
      WHEN source_collection = 'orders' AND full_document.totals.grand_total > 5000 THEN 'high_value'
      WHEN source_collection = 'products' AND update_description.updated_fields.inventory_count = 0 THEN 'critical'
      WHEN source_collection = 'customers' AND full_document.tier = 'enterprise' THEN 'vip'
      ELSE 'standard'
    END as impact_level,

    -- Coordinated actions needed
    CASE business_event_type
      WHEN 'new_sale' THEN ARRAY['update_analytics', 'check_inventory', 'process_loyalty_points']
      WHEN 'customer_acquisition' THEN ARRAY['send_welcome', 'setup_recommendations', 'track_source']
      WHEN 'low_inventory' THEN ARRAY['alert_purchasing', 'update_website', 'notify_subscribers']
      WHEN 'order_cancellation' THEN ARRAY['release_inventory', 'process_refund', 'update_analytics']
      ELSE ARRAY[]::TEXT[]
    END as coordinated_actions

  FROM cross_collection_changes ccc
),

event_aggregation AS (
  SELECT 
    bec.*,

    -- Aggregate related changes within time window
    COUNT(*) OVER (
      PARTITION BY business_event_type, impact_level 
      ORDER BY change_timestamp 
      RANGE BETWEEN INTERVAL '5 minutes' PRECEDING AND CURRENT ROW
    ) as related_events_count,

    -- Time since last similar event
    EXTRACT(EPOCH FROM (
      change_timestamp - LAG(change_timestamp) OVER (
        PARTITION BY business_event_type 
        ORDER BY change_timestamp
      )
    )) as seconds_since_last_similar

  FROM business_event_classification bec
)

SELECT 
  business_event_type,
  impact_level,
  source_collection,
  document_key,
  related_events_count,
  coordinated_actions,

  -- Event batching for efficiency
  CASE 
    WHEN related_events_count > 5 AND seconds_since_last_similar < 300 THEN 'batch_process'
    WHEN impact_level = 'critical' THEN 'immediate_process'
    ELSE 'normal_process'
  END as processing_mode,

  -- Comprehensive event payload
  JSON_BUILD_OBJECT(
    'event_id', GENERATE_UUID(),
    'business_event_type', business_event_type,
    'impact_level', impact_level,
    'source_collection', source_collection,
    'operation_type', operation_type,
    'document_id', document_key,
    'full_document', full_document,
    'coordinated_actions', coordinated_actions,
    'related_events_count', related_events_count,
    'processing_mode', processing_mode,
    'timestamp', change_timestamp
  ) as event_payload,

  change_timestamp

FROM event_aggregation
WHERE business_event_type != 'standard_change'
ORDER BY 
  CASE impact_level 
    WHEN 'critical' THEN 1
    WHEN 'high_value' THEN 2  
    WHEN 'vip' THEN 3
    ELSE 4
  END,
  change_timestamp DESC;

-- Change stream monitoring and analytics
CREATE MATERIALIZED VIEW change_stream_analytics AS
WITH change_stream_metrics AS (
  SELECT 
    DATE_TRUNC('hour', event_timestamp) as hour_bucket,
    source_collection,
    operation_type,
    business_event_type,
    impact_level,

    -- Volume metrics
    COUNT(*) as event_count,
    COUNT(DISTINCT document_key) as unique_documents,

    -- Processing metrics
    AVG(processing_latency_ms) as avg_processing_latency,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY processing_latency_ms) as p95_processing_latency,

    -- Success rate
    COUNT(*) FILTER (WHERE processing_status = 'success') as successful_events,
    COUNT(*) FILTER (WHERE processing_status = 'failed') as failed_events,
    COUNT(*) FILTER (WHERE processing_status = 'retry') as retry_events,

    -- Service delivery metrics
    AVG(service_delivery_time_ms) as avg_service_delivery_time,
    COUNT(*) FILTER (WHERE service_delivery_success = true) as successful_deliveries

  FROM change_stream_events_log
  WHERE event_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  GROUP BY 
    DATE_TRUNC('hour', event_timestamp),
    source_collection,
    operation_type, 
    business_event_type,
    impact_level
),

performance_analysis AS (
  SELECT 
    csm.*,

    -- Success rates
    ROUND((successful_events::numeric / NULLIF(event_count, 0)) * 100, 2) as success_rate_percent,
    ROUND((successful_deliveries::numeric / NULLIF(event_count, 0)) * 100, 2) as delivery_success_rate_percent,

    -- Performance health score
    CASE 
      WHEN avg_processing_latency <= 100 AND success_rate_percent >= 95 THEN 'excellent'
      WHEN avg_processing_latency <= 500 AND success_rate_percent >= 90 THEN 'good'
      WHEN avg_processing_latency <= 1000 AND success_rate_percent >= 85 THEN 'fair'
      ELSE 'poor'
    END as performance_health,

    -- Trend analysis
    LAG(event_count) OVER (
      PARTITION BY source_collection, business_event_type 
      ORDER BY hour_bucket
    ) as previous_hour_count,

    LAG(avg_processing_latency) OVER (
      PARTITION BY source_collection, business_event_type
      ORDER BY hour_bucket  
    ) as previous_hour_latency

  FROM change_stream_metrics csm
)

SELECT 
  TO_CHAR(hour_bucket, 'YYYY-MM-DD HH24:00') as monitoring_hour,
  source_collection,
  business_event_type,
  impact_level,
  event_count,
  unique_documents,

  -- Performance metrics
  ROUND(avg_processing_latency::numeric, 2) as avg_processing_latency_ms,
  ROUND(p95_processing_latency::numeric, 2) as p95_processing_latency_ms,
  success_rate_percent,
  delivery_success_rate_percent,
  performance_health,

  -- Volume trends
  CASE 
    WHEN previous_hour_count IS NOT NULL THEN
      ROUND(((event_count - previous_hour_count)::numeric / NULLIF(previous_hour_count, 0)) * 100, 1)
    ELSE NULL
  END as volume_change_percent,

  -- Performance trends
  CASE 
    WHEN previous_hour_latency IS NOT NULL THEN
      ROUND(((avg_processing_latency - previous_hour_latency)::numeric / NULLIF(previous_hour_latency, 0)) * 100, 1)
    ELSE NULL
  END as latency_change_percent,

  -- Health indicators
  CASE 
    WHEN performance_health = 'excellent' THEN '🟢 Optimal'
    WHEN performance_health = 'good' THEN '🟡 Good'
    WHEN performance_health = 'fair' THEN '🟠 Attention Needed'
    ELSE '🔴 Critical'
  END as health_indicator,

  -- Recommendations
  CASE 
    WHEN failed_events > event_count * 0.05 THEN 'High failure rate - investigate error causes'
    WHEN avg_processing_latency > 1000 THEN 'High latency - optimize event processing'
    WHEN retry_events > event_count * 0.1 THEN 'High retry rate - check service availability'
    WHEN event_count > previous_hour_count * 2 THEN 'Unusual volume spike - monitor capacity'
    ELSE 'Performance within normal parameters'
  END as recommendation

FROM performance_analysis
WHERE hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
ORDER BY hour_bucket DESC, event_count DESC;

-- QueryLeaf provides comprehensive change stream capabilities:
-- 1. SQL-familiar change stream creation and management syntax
-- 2. Advanced event filtering and transformation with business logic
-- 3. Cross-collection event aggregation and coordination patterns
-- 4. Real-time workflow orchestration with SQL-style routing
-- 5. Comprehensive monitoring and analytics for event processing
-- 6. Service integration patterns with familiar SQL constructs
-- 7. Event batching and performance optimization strategies
-- 8. Business rule integration for intelligent event processing
-- 9. Error handling and retry logic with SQL-familiar patterns
-- 10. Native integration with MongoDB Change Streams infrastructure

Best Practices for Change Streams Implementation

Event-Driven Architecture Design

Essential practices for building production-ready event-driven systems:

Event Filtering: Design precise change stream filters to minimize processing overhead
Service Decoupling: Use event-driven patterns to maintain loose coupling between microservices
Error Handling: Implement comprehensive retry logic and dead letter patterns
Event Ordering: Consider event ordering requirements for business-critical workflows
Monitoring: Deploy extensive monitoring for event processing pipelines and service health
Scalability: Design event processing to scale horizontally with growing data volumes

Performance Optimization

Optimize change streams for high-throughput production environments:

Pipeline Optimization: Use efficient aggregation pipelines to filter events at the database level
Batch Processing: Group related events for efficient processing where appropriate
Resource Management: Monitor and manage change stream resource consumption
Service Coordination: Implement intelligent routing to avoid overwhelming downstream services
Caching Strategy: Use appropriate caching to reduce redundant processing
Capacity Planning: Plan for peak event volumes and service capacity requirements

Conclusion

MongoDB Change Streams provide comprehensive real-time event processing capabilities that enable sophisticated event-driven microservices architectures without the complexity and overhead of external message queues or polling mechanisms. The combination of native change data capture, intelligent event filtering, and comprehensive service integration patterns makes it ideal for building responsive, scalable distributed systems.

Key Change Streams benefits include:

Real-Time Processing: Native change data capture without polling overhead or latency
Intelligent Filtering: Comprehensive event filtering and transformation at the database level
Service Integration: Built-in patterns for microservices coordination and event routing
Workflow Orchestration: Advanced business logic integration for complex event-driven workflows
Scalable Architecture: Horizontal scaling capabilities that grow with your application needs
Developer Familiarity: SQL-compatible event processing patterns with MongoDB's flexible data model

Whether you're building e-commerce platforms, real-time analytics systems, IoT applications, or any system requiring immediate responsiveness to data changes, MongoDB Change Streams with QueryLeaf's SQL-familiar interface provides the foundation for modern event-driven architectures that scale efficiently while maintaining familiar development patterns.

QueryLeaf Integration: QueryLeaf automatically translates SQL-style change stream operations into MongoDB Change Streams, providing familiar CREATE CHANGE_STREAM syntax, event filtering with SQL WHERE clauses, and comprehensive event routing patterns. Advanced event-driven workflows, business rule integration, and microservices coordination are seamlessly handled through familiar SQL constructs, making sophisticated real-time architecture both powerful and approachable for SQL-oriented development teams.

The integration of comprehensive event processing capabilities with SQL-familiar operations makes MongoDB an ideal platform for applications requiring both real-time responsiveness and familiar database interaction patterns, ensuring your event-driven solutions remain both effective and maintainable as they scale and evolve.

November 13, 2025
23 min read

MongoDB GridFS for File Storage and Binary Data Management: Production-Scale File Handling with SQL-Style Binary Operations

Modern applications require sophisticated file storage capabilities that can handle large binary files, streaming operations, and metadata management while maintaining performance and scalability across distributed systems. Traditional approaches to file storage often struggle with large file limitations, database size constraints, and the complexity of managing both structured data and binary content within a unified system.

MongoDB GridFS provides comprehensive file storage capabilities that seamlessly integrate with document databases, enabling applications to store and retrieve large files while maintaining ACID properties, metadata relationships, and query capabilities. Unlike external file systems that require separate infrastructure and complex synchronization, GridFS provides unified data management where files and metadata exist within the same transactional boundary and replication topology.

The Traditional File Storage Limitation Challenge

Conventional approaches to managing large files and binary data face significant architectural and performance limitations:

-- Traditional PostgreSQL BLOB storage - severe limitations with large files and performance

-- Basic file storage table with BYTEA limitations
CREATE TABLE file_storage (
    file_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    filename VARCHAR(500) NOT NULL,
    mime_type VARCHAR(100) NOT NULL,
    file_size BIGINT NOT NULL,
    file_data BYTEA NOT NULL,  -- Limited to ~1GB, causes performance issues
    upload_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    uploader_id UUID NOT NULL,

    -- Basic metadata
    description TEXT,
    tags TEXT[],
    is_public BOOLEAN DEFAULT false,
    access_count INTEGER DEFAULT 0,
    last_accessed TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- File organization
    folder_path VARCHAR(1000),
    parent_folder_id UUID,

    -- Storage metadata
    storage_location VARCHAR(200),
    checksum VARCHAR(64),
    compression_type VARCHAR(20),
    original_size BIGINT
);

-- Additional tables for file relationships and versions
CREATE TABLE file_versions (
    version_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    file_id UUID NOT NULL REFERENCES file_storage(file_id),
    version_number INTEGER NOT NULL,
    file_data BYTEA NOT NULL,  -- Duplicate storage issues
    version_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    version_notes TEXT,
    created_by UUID NOT NULL
);

CREATE TABLE file_shares (
    share_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    file_id UUID NOT NULL REFERENCES file_storage(file_id),
    shared_with_user UUID NOT NULL,
    permission_level VARCHAR(20) NOT NULL CHECK (permission_level IN ('read', 'write', 'admin')),
    shared_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    expires_at TIMESTAMP,
    share_link VARCHAR(500)
);

-- Attempt to retrieve files with associated metadata (problematic query)
WITH file_analysis AS (
    SELECT 
        f.file_id,
        f.filename,
        f.mime_type,
        f.file_size,
        f.upload_date,
        f.uploader_id,
        f.access_count,
        f.tags,
        f.folder_path,

        -- File data retrieval (major performance bottleneck)
        CASE 
            WHEN f.file_size > 1048576 THEN 'Large file - performance warning'
            WHEN f.file_size > 52428800 THEN 'Very large file - severe performance impact'
            ELSE 'Normal size file'
        END as size_warning,

        -- Version information
        COUNT(fv.version_id) as version_count,
        MAX(fv.version_date) as latest_version_date,
        SUM(LENGTH(fv.file_data)) as total_version_storage, -- Expensive calculation

        -- Sharing information
        COUNT(fs.share_id) as share_count,
        ARRAY_AGG(DISTINCT fs.permission_level) as permission_levels,

        -- Storage efficiency analysis
        f.file_size + COALESCE(SUM(LENGTH(fv.file_data)), 0) as total_storage_used,
        ROUND(
            (f.file_size::float / (f.file_size + COALESCE(SUM(LENGTH(fv.file_data)), 0))) * 100, 
            2
        ) as storage_efficiency_pct

    FROM file_storage f
    LEFT JOIN file_versions fv ON f.file_id = fv.file_id
    LEFT JOIN file_shares fs ON f.file_id = fs.file_id
    WHERE f.upload_date >= CURRENT_DATE - INTERVAL '90 days'
      AND f.file_size > 0
    GROUP BY f.file_id, f.filename, f.mime_type, f.file_size, f.upload_date, 
             f.uploader_id, f.access_count, f.tags, f.folder_path
),
file_performance_metrics AS (
    SELECT 
        fa.*,
        u.username as uploader_name,
        u.email as uploader_email,

        -- Performance classification
        CASE 
            WHEN fa.file_size > 104857600 THEN 'Performance Critical'  -- >100MB
            WHEN fa.file_size > 10485760 THEN 'Performance Impact'    -- >10MB  
            WHEN fa.file_size > 1048576 THEN 'Moderate Impact'        -- >1MB
            ELSE 'Low Impact'
        END as performance_impact,

        -- Storage optimization recommendations  
        CASE
            WHEN fa.version_count > 10 AND fa.storage_efficiency_pct < 20 THEN 
                'High version overhead - implement version cleanup'
            WHEN fa.total_storage_used > 1073741824 THEN  -- >1GB total
                'Large storage footprint - consider external storage'
            WHEN fa.file_size > 52428800 AND fa.access_count < 5 THEN  -- >50MB, low access
                'Large rarely-accessed file - candidate for archival'
            ELSE 'Storage usage acceptable'
        END as optimization_recommendation,

        -- Access patterns analysis
        CASE
            WHEN fa.access_count > 1000 THEN 'High traffic - consider CDN/caching'
            WHEN fa.access_count > 100 THEN 'Moderate traffic - monitor performance'
            WHEN fa.access_count < 10 THEN 'Low traffic - archival candidate'
            ELSE 'Normal traffic pattern'
        END as access_pattern_analysis

    FROM file_analysis fa
    JOIN users u ON fa.uploader_id = u.user_id
),
storage_summary AS (
    SELECT 
        COUNT(*) as total_files,
        SUM(file_size) as total_storage_bytes,
        ROUND(AVG(file_size)::numeric, 0) as avg_file_size,
        MAX(file_size) as largest_file_size,
        COUNT(*) FILTER (WHERE performance_impact = 'Performance Critical') as critical_files,
        COUNT(*) FILTER (WHERE version_count > 5) as high_version_files,

        -- Storage distribution
        ROUND((SUM(file_size)::numeric / 1024 / 1024 / 1024), 2) as total_storage_gb,
        ROUND((SUM(total_storage_used)::numeric / 1024 / 1024 / 1024), 2) as total_with_versions_gb,

        -- Performance impact assessment
        ROUND((
            COUNT(*) FILTER (WHERE performance_impact IN ('Performance Critical', 'Performance Impact'))::float /
            COUNT(*) * 100
        ), 1) as high_impact_files_pct
    FROM file_performance_metrics
)
SELECT 
    -- File details
    fpm.file_id,
    fpm.filename,
    fpm.mime_type,
    ROUND((fpm.file_size::numeric / 1024 / 1024), 2) as file_size_mb,
    fpm.upload_date,
    fpm.uploader_name,
    fpm.access_count,
    fpm.version_count,

    -- Performance and optimization
    fpm.performance_impact,
    fpm.optimization_recommendation,
    fpm.access_pattern_analysis,
    fpm.storage_efficiency_pct,

    -- File organization
    fpm.tags,
    fpm.folder_path,
    fpm.share_count,

    -- Summary statistics (same for all rows)
    ss.total_files,
    ss.total_storage_gb,
    ss.total_with_versions_gb,
    ss.high_impact_files_pct,

    -- Issues and warnings
    CASE 
        WHEN fpm.file_size > 1073741824 THEN 'WARNING: File size exceeds PostgreSQL practical limits'
        WHEN fpm.total_storage_used > 2147483648 THEN 'CRITICAL: Storage usage may cause performance issues'
        WHEN fpm.version_count > 20 THEN 'WARNING: Excessive version history'
        ELSE 'No major issues detected'
    END as system_warnings

FROM file_performance_metrics fpm
CROSS JOIN storage_summary ss
WHERE fpm.performance_impact IN ('Performance Critical', 'Performance Impact', 'Moderate Impact')
ORDER BY fpm.file_size DESC, fpm.access_count DESC
LIMIT 50;

-- Problems with traditional PostgreSQL BLOB approach:
-- 1. BYTEA field size limitations (~1GB practical limit, 1GB theoretical limit)
-- 2. Severe memory consumption during file retrieval and processing
-- 3. Query performance degradation with large binary data in results
-- 4. Backup and replication overhead due to large binary data in tables
-- 5. No built-in streaming capabilities for large file transfers
-- 6. Expensive storage overhead for file versioning and duplication
-- 7. Limited file organization and hierarchical structure support
-- 8. No native file chunk management or partial retrieval capabilities
-- 9. Complex application-level implementation required for file operations
-- 10. Poor integration between file operations and transactional data management

-- Alternative external storage approach (additional complexity)
CREATE TABLE external_file_references (
    file_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    filename VARCHAR(500) NOT NULL,
    mime_type VARCHAR(100) NOT NULL,
    file_size BIGINT NOT NULL,

    -- External storage references
    storage_provider VARCHAR(50) NOT NULL,  -- 'aws_s3', 'azure_blob', 'gcp_storage'
    storage_bucket VARCHAR(100) NOT NULL,
    storage_path VARCHAR(1000) NOT NULL,
    storage_url VARCHAR(2000),

    -- File metadata (separated from binary data)
    upload_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    uploader_id UUID NOT NULL,
    checksum VARCHAR(64),

    -- Synchronization challenges
    sync_status VARCHAR(20) DEFAULT 'pending',
    last_sync_attempt TIMESTAMP,
    sync_error_message TEXT
);

-- External storage approach problems:
-- 1. Complex synchronization between database metadata and external files
-- 2. Eventual consistency issues between file operations and transactions  
-- 3. Additional infrastructure dependencies and failure points
-- 4. Complex backup and disaster recovery coordination
-- 5. Network latency and bandwidth costs for file operations
-- 6. Security and access control complexity across multiple systems
-- 7. Limited transactional guarantees between file and data operations
-- 8. Vendor lock-in and migration challenges with cloud storage providers
-- 9. Additional cost and complexity for CDN, caching, and performance optimization
-- 10. Difficult monitoring and debugging across distributed storage systems

MongoDB GridFS provides comprehensive file storage with unified data management:

// MongoDB GridFS - comprehensive file storage and binary data management system
const { MongoClient, GridFSBucket, ObjectId } = require('mongodb');
const fs = require('fs');
const path = require('path');
const crypto = require('crypto');
const mime = require('mime-types');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('enterprise_file_management');

// Advanced GridFS file management system with comprehensive features
class EnterpriseGridFSManager {
  constructor(db, bucketName = 'fs') {
    this.db = db;
    this.bucket = new GridFSBucket(db, { bucketName: bucketName });

    // Configuration for production file management
    this.config = {
      bucketName: bucketName,
      chunkSizeBytes: 261120, // 255KB chunks for optimal streaming
      maxFileSize: 16777216000, // 16GB maximum file size
      allowedMimeTypes: [], // Empty array allows all types
      compressionEnabled: true,
      versioningEnabled: true,
      metadataIndexing: true,
      streamingOptimization: true
    };

    // Collections for enhanced file management
    this.collections = {
      files: db.collection(`${bucketName}.files`),
      chunks: db.collection(`${bucketName}.chunks`),
      metadata: db.collection('file_metadata'),
      versions: db.collection('file_versions'),
      shares: db.collection('file_shares'),
      analytics: db.collection('file_analytics')
    };

    // Performance and analytics tracking
    this.performanceMetrics = {
      uploadStats: new Map(),
      downloadStats: new Map(),
      storageStats: new Map()
    };

    this.initializeGridFSSystem();
  }

  async initializeGridFSSystem() {
    console.log('Initializing enterprise GridFS system...');

    try {
      // Create optimized indexes for GridFS performance
      await this.setupGridFSIndexes();

      // Initialize metadata tracking
      await this.initializeMetadataSystem();

      // Setup file analytics and monitoring
      await this.initializeFileAnalytics();

      // Configure streaming and performance optimization
      await this.configureStreamingOptimization();

      console.log('GridFS system initialized successfully');

    } catch (error) {
      console.error('Error initializing GridFS system:', error);
      throw error;
    }
  }

  async setupGridFSIndexes() {
    console.log('Setting up optimized GridFS indexes...');

    try {
      // Optimized indexes for GridFS files collection
      await this.collections.files.createIndex({ filename: 1, uploadDate: -1 });
      await this.collections.files.createIndex({ 'metadata.contentType': 1, uploadDate: -1 });
      await this.collections.files.createIndex({ 'metadata.uploader': 1, uploadDate: -1 });
      await this.collections.files.createIndex({ 'metadata.tags': 1 });
      await this.collections.files.createIndex({ 'metadata.folder': 1, filename: 1 });
      await this.collections.files.createIndex({ length: -1, uploadDate: -1 }); // Size-based queries

      // Specialized indexes for chunks collection performance  
      await this.collections.chunks.createIndex({ files_id: 1, n: 1 }, { unique: true });

      // Extended metadata indexes
      await this.collections.metadata.createIndex({ fileId: 1 }, { unique: true });
      await this.collections.metadata.createIndex({ 'customMetadata.project': 1 });
      await this.collections.metadata.createIndex({ 'customMetadata.department': 1 });
      await this.collections.metadata.createIndex({ 'permissions.userId': 1 });

      // Version management indexes
      await this.collections.versions.createIndex({ originalFileId: 1, versionNumber: -1 });
      await this.collections.versions.createIndex({ createdAt: -1 });

      // File sharing indexes
      await this.collections.shares.createIndex({ fileId: 1, userId: 1 });
      await this.collections.shares.createIndex({ shareToken: 1 }, { unique: true });
      await this.collections.shares.createIndex({ expiresAt: 1 }, { expireAfterSeconds: 0 });

      console.log('GridFS indexes created successfully');

    } catch (error) {
      console.error('Error creating GridFS indexes:', error);
      throw error;
    }
  }

  async uploadFileWithMetadata(filePath, metadata = {}) {
    console.log(`Uploading file: ${filePath}`);
    const startTime = Date.now();

    try {
      // Validate file and prepare metadata
      const fileStats = await fs.promises.stat(filePath);
      const filename = path.basename(filePath);
      const mimeType = mime.lookup(filePath) || 'application/octet-stream';

      // Validate file size and type
      await this.validateFileUpload(filePath, fileStats, mimeType);

      // Generate comprehensive metadata
      const enhancedMetadata = await this.generateFileMetadata(filePath, fileStats, metadata);

      // Create GridFS upload stream with optimized settings
      const uploadStream = this.bucket.openUploadStream(filename, {
        chunkSizeBytes: this.config.chunkSizeBytes,
        metadata: enhancedMetadata
      });

      // Create read stream and setup progress tracking
      const readStream = fs.createReadStream(filePath);
      let uploadedBytes = 0;

      readStream.on('data', (chunk) => {
        uploadedBytes += chunk.length;
        const progress = (uploadedBytes / fileStats.size * 100).toFixed(1);
        if (uploadedBytes % (1024 * 1024) === 0 || uploadedBytes === fileStats.size) {
          console.log(`Upload progress: ${progress}% (${uploadedBytes}/${fileStats.size} bytes)`);
        }
      });

      // Handle upload completion
      return new Promise((resolve, reject) => {
        uploadStream.on('error', (error) => {
          console.error('Upload error:', error);
          reject(error);
        });

        uploadStream.on('finish', async () => {
          const uploadTime = Date.now() - startTime;
          console.log(`File uploaded successfully in ${uploadTime}ms`);

          try {
            // Store extended metadata
            await this.storeExtendedMetadata(uploadStream.id, enhancedMetadata, filePath);

            // Update analytics
            await this.updateUploadAnalytics(uploadStream.id, fileStats.size, uploadTime);

            // Generate file preview if applicable
            await this.generateFilePreview(uploadStream.id, mimeType, filePath);

            resolve({
              fileId: uploadStream.id,
              filename: filename,
              size: fileStats.size,
              mimeType: mimeType,
              uploadTime: uploadTime,
              metadata: enhancedMetadata,
              success: true
            });

          } catch (metadataError) {
            console.error('Error storing extended metadata:', metadataError);
            // File upload succeeded, but metadata storage failed
            resolve({
              fileId: uploadStream.id,
              filename: filename,
              size: fileStats.size,
              success: true,
              warning: 'Extended metadata storage failed',
              error: metadataError.message
            });
          }
        });

        // Start the upload
        readStream.pipe(uploadStream);
      });

    } catch (error) {
      console.error(`Failed to upload file ${filePath}:`, error);
      throw error;
    }
  }

  async downloadFileStream(fileId, options = {}) {
    console.log(`Creating download stream for file: ${fileId}`);

    try {
      // Validate file exists and get metadata
      const fileInfo = await this.getFileInfo(fileId);
      if (!fileInfo) {
        throw new Error(`File not found: ${fileId}`);
      }

      // Check download permissions
      await this.validateDownloadPermissions(fileId, options.userId);

      // Create optimized download stream
      const downloadStream = this.bucket.openDownloadStream(new ObjectId(fileId), {
        start: options.start || 0,
        end: options.end || undefined
      });

      // Track download analytics
      await this.updateDownloadAnalytics(fileId, options.userId);

      // Setup error handling
      downloadStream.on('error', (error) => {
        console.error(`Download stream error for file ${fileId}:`, error);
      });

      return {
        stream: downloadStream,
        fileInfo: fileInfo,
        contentType: fileInfo.metadata?.contentType || 'application/octet-stream',
        contentLength: fileInfo.length,
        filename: fileInfo.filename
      };

    } catch (error) {
      console.error(`Failed to create download stream for file ${fileId}:`, error);
      throw error;
    }
  }

  async searchFiles(searchCriteria, options = {}) {
    console.log('Performing advanced file search...');

    try {
      const pipeline = [];

      // Build search pipeline based on criteria
      const matchStage = this.buildSearchMatchStage(searchCriteria);
      if (Object.keys(matchStage).length > 0) {
        pipeline.push({ $match: matchStage });
      }

      // Join with extended metadata
      pipeline.push({
        $lookup: {
          from: 'file_metadata',
          localField: '_id',
          foreignField: 'fileId',
          as: 'extendedMetadata'
        }
      });

      // Join with version information
      pipeline.push({
        $lookup: {
          from: 'file_versions',
          localField: '_id', 
          foreignField: 'originalFileId',
          as: 'versions',
          pipeline: [
            { $sort: { versionNumber: -1 } },
            { $limit: 5 }
          ]
        }
      });

      // Join with sharing information
      pipeline.push({
        $lookup: {
          from: 'file_shares',
          localField: '_id',
          foreignField: 'fileId',
          as: 'shares'
        }
      });

      // Add computed fields and analytics
      pipeline.push({
        $addFields: {
          // File size in human readable format
          sizeFormatted: {
            $switch: {
              branches: [
                { case: { $gte: ['$length', 1073741824] }, then: { $concat: [{ $toString: { $round: [{ $divide: ['$length', 1073741824] }, 2] } }, ' GB'] } },
                { case: { $gte: ['$length', 1048576] }, then: { $concat: [{ $toString: { $round: [{ $divide: ['$length', 1048576] }, 2] } }, ' MB'] } },
                { case: { $gte: ['$length', 1024] }, then: { $concat: [{ $toString: { $round: [{ $divide: ['$length', 1024] }, 2] } }, ' KB'] } }
              ],
              default: { $concat: [{ $toString: '$length' }, ' bytes'] }
            }
          },

          // Version information
          versionCount: { $size: '$versions' },
          latestVersion: { $arrayElemAt: ['$versions.versionNumber', 0] },

          // Sharing information
          shareCount: { $size: '$shares' },
          isShared: { $gt: [{ $size: '$shares' }, 0] },

          // Extended metadata extraction
          customMetadata: { $arrayElemAt: ['$extendedMetadata.customMetadata', 0] },
          permissions: { $arrayElemAt: ['$extendedMetadata.permissions', 0] },

          // File age calculation
          ageInDays: {
            $round: [{
              $divide: [
                { $subtract: [new Date(), '$uploadDate'] },
                1000 * 60 * 60 * 24
              ]
            }, 0]
          }
        }
      });

      // Apply sorting
      const sortStage = this.buildSortStage(options.sortBy, options.sortOrder);
      pipeline.push({ $sort: sortStage });

      // Apply pagination
      if (options.skip) pipeline.push({ $skip: options.skip });
      if (options.limit) pipeline.push({ $limit: options.limit });

      // Project final results
      pipeline.push({
        $project: {
          _id: 1,
          filename: 1,
          length: 1,
          sizeFormatted: 1,
          uploadDate: 1,
          ageInDays: 1,
          'metadata.contentType': 1,
          'metadata.uploader': 1,
          'metadata.tags': 1,
          'metadata.folder': 1,
          versionCount: 1,
          latestVersion: 1,
          shareCount: 1,
          isShared: 1,
          customMetadata: 1,
          permissions: 1
        }
      });

      // Execute search
      const results = await this.collections.files.aggregate(pipeline).toArray();

      // Get total count for pagination
      const totalCount = await this.getSearchResultCount(searchCriteria);

      return {
        files: results,
        totalCount: totalCount,
        pageSize: options.limit || results.length,
        currentPage: options.skip ? Math.floor(options.skip / (options.limit || 20)) + 1 : 1,
        totalPages: options.limit ? Math.ceil(totalCount / options.limit) : 1,
        searchCriteria: searchCriteria,
        executionTime: Date.now() - (options.startTime || Date.now())
      };

    } catch (error) {
      console.error('File search error:', error);
      throw error;
    }
  }

  async generateFileMetadata(filePath, fileStats, customMetadata) {
    const metadata = {
      // Basic file information
      originalPath: filePath,
      contentType: mime.lookup(filePath) || 'application/octet-stream',
      size: fileStats.size,

      // File timestamps
      createdAt: fileStats.birthtime,
      modifiedAt: fileStats.mtime,
      uploadedAt: new Date(),

      // User and context information
      uploader: customMetadata.uploader || 'system',
      uploaderIP: customMetadata.uploaderIP,
      userAgent: customMetadata.userAgent,

      // File organization
      folder: customMetadata.folder || '/',
      tags: customMetadata.tags || [],
      category: customMetadata.category || 'general',

      // Security and permissions
      permissions: customMetadata.permissions || { public: false, users: [] },
      encryption: customMetadata.encryption || false,

      // File characteristics
      checksum: await this.calculateFileChecksum(filePath),
      compression: this.shouldCompressFile(mime.lookup(filePath)),

      // Custom business metadata
      ...customMetadata
    };

    return metadata;
  }

  async calculateFileChecksum(filePath) {
    return new Promise((resolve, reject) => {
      const hash = crypto.createHash('sha256');
      const stream = fs.createReadStream(filePath);

      stream.on('data', (data) => hash.update(data));
      stream.on('end', () => resolve(hash.digest('hex')));
      stream.on('error', (error) => reject(error));
    });
  }

  async storeExtendedMetadata(fileId, metadata, filePath) {
    const extendedMetadata = {
      fileId: fileId,
      customMetadata: metadata,
      createdAt: new Date(),

      // File analysis results
      analysis: {
        isImage: this.isImageFile(metadata.contentType),
        isVideo: this.isVideoFile(metadata.contentType),
        isAudio: this.isAudioFile(metadata.contentType),
        isDocument: this.isDocumentFile(metadata.contentType),
        isArchive: this.isArchiveFile(metadata.contentType)
      },

      // Storage optimization
      storageOptimization: {
        compressionRecommended: this.shouldCompressFile(metadata.contentType),
        archivalCandidate: false, // Will be updated based on access patterns
        cachingRecommended: this.shouldCacheFile(metadata.contentType)
      },

      // Performance tracking
      performance: {
        uploadDuration: 0, // Will be updated
        averageDownloadTime: null,
        accessCount: 0,
        lastAccessed: null
      }
    };

    await this.collections.metadata.insertOne(extendedMetadata);
  }

  async updateUploadAnalytics(fileId, fileSize, uploadTime) {
    const analytics = {
      fileId: fileId,
      event: 'upload',
      timestamp: new Date(),
      metrics: {
        fileSize: fileSize,
        uploadDuration: uploadTime,
        uploadSpeed: Math.round(fileSize / (uploadTime / 1000)), // bytes per second
      }
    };

    await this.collections.analytics.insertOne(analytics);
  }

  async updateDownloadAnalytics(fileId, userId) {
    const analytics = {
      fileId: new ObjectId(fileId),
      userId: userId,
      event: 'download',
      timestamp: new Date(),
      ipAddress: null, // Would be filled from request context
      userAgent: null  // Would be filled from request context
    };

    await this.collections.analytics.insertOne(analytics);

    // Update access count and last accessed time in metadata
    await this.collections.metadata.updateOne(
      { fileId: new ObjectId(fileId) },
      {
        $inc: { 'performance.accessCount': 1 },
        $set: { 'performance.lastAccessed': new Date() }
      }
    );
  }

  buildSearchMatchStage(criteria) {
    const match = {};

    // Filename search
    if (criteria.filename) {
      match.filename = new RegExp(criteria.filename, 'i');
    }

    // Content type filtering
    if (criteria.contentType) {
      match['metadata.contentType'] = criteria.contentType;
    }

    // Size range filtering
    if (criteria.minSize || criteria.maxSize) {
      match.length = {};
      if (criteria.minSize) match.length.$gte = criteria.minSize;
      if (criteria.maxSize) match.length.$lte = criteria.maxSize;
    }

    // Date range filtering
    if (criteria.dateFrom || criteria.dateTo) {
      match.uploadDate = {};
      if (criteria.dateFrom) match.uploadDate.$gte = new Date(criteria.dateFrom);
      if (criteria.dateTo) match.uploadDate.$lte = new Date(criteria.dateTo);
    }

    // Tag filtering
    if (criteria.tags && criteria.tags.length > 0) {
      match['metadata.tags'] = { $in: criteria.tags };
    }

    // Folder filtering
    if (criteria.folder) {
      match['metadata.folder'] = criteria.folder;
    }

    // Uploader filtering
    if (criteria.uploader) {
      match['metadata.uploader'] = criteria.uploader;
    }

    return match;
  }

  buildSortStage(sortBy = 'uploadDate', sortOrder = 'desc') {
    const sortDirection = sortOrder.toLowerCase() === 'desc' ? -1 : 1;

    switch (sortBy.toLowerCase()) {
      case 'filename':
        return { filename: sortDirection };
      case 'size':
        return { length: sortDirection };
      case 'contenttype':
        return { 'metadata.contentType': sortDirection };
      case 'uploader':
        return { 'metadata.uploader': sortDirection };
      default:
        return { uploadDate: sortDirection };
    }
  }

  async getFileInfo(fileId) {
    try {
      const file = await this.collections.files.findOne({ _id: new ObjectId(fileId) });
      return file;
    } catch (error) {
      console.error(`Error getting file info for ${fileId}:`, error);
      return null;
    }
  }

  async validateFileUpload(filePath, fileStats, mimeType) {
    // Size validation
    if (fileStats.size > this.config.maxFileSize) {
      throw new Error(`File size ${fileStats.size} exceeds maximum allowed size ${this.config.maxFileSize}`);
    }

    // MIME type validation (if restrictions configured)
    if (this.config.allowedMimeTypes.length > 0 && !this.config.allowedMimeTypes.includes(mimeType)) {
      throw new Error(`File type ${mimeType} is not allowed`);
    }

    // File accessibility validation
    try {
      await fs.promises.access(filePath, fs.constants.R_OK);
    } catch (error) {
      throw new Error(`Cannot read file: ${filePath}`);
    }
  }

  async validateDownloadPermissions(fileId, userId) {
    // In a real implementation, this would check user permissions
    // For now, we'll just validate the file exists
    const fileInfo = await this.getFileInfo(fileId);
    if (!fileInfo) {
      throw new Error(`File not found: ${fileId}`);
    }
    return true;
  }

  async getSearchResultCount(searchCriteria) {
    const matchStage = this.buildSearchMatchStage(searchCriteria);
    return await this.collections.files.countDocuments(matchStage);
  }

  // Utility methods for file type detection and optimization

  isImageFile(contentType) {
    return contentType && contentType.startsWith('image/');
  }

  isVideoFile(contentType) {
    return contentType && contentType.startsWith('video/');
  }

  isAudioFile(contentType) {
    return contentType && contentType.startsWith('audio/');
  }

  isDocumentFile(contentType) {
    const documentTypes = [
      'application/pdf',
      'application/msword',
      'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
      'application/vnd.ms-excel',
      'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
      'text/plain',
      'text/csv'
    ];
    return documentTypes.includes(contentType);
  }

  isArchiveFile(contentType) {
    const archiveTypes = [
      'application/zip',
      'application/x-rar-compressed',
      'application/x-tar',
      'application/gzip'
    ];
    return archiveTypes.includes(contentType);
  }

  shouldCompressFile(contentType) {
    // Don't compress already compressed files
    const noCompressionTypes = [
      'image/jpeg',
      'image/png', 
      'video/',
      'audio/',
      'application/zip',
      'application/x-rar-compressed'
    ];

    return !noCompressionTypes.some(type => contentType && contentType.startsWith(type));
  }

  shouldCacheFile(contentType) {
    // Cache frequently accessed file types
    const cacheableTypes = [
      'image/',
      'text/css',
      'text/javascript',
      'application/javascript'
    ];

    return cacheableTypes.some(type => contentType && contentType.startsWith(type));
  }

  async generateFilePreview(fileId, mimeType, filePath) {
    // Placeholder for preview generation logic
    // Would implement thumbnail generation for images, 
    // text extraction for documents, etc.
    console.log(`Preview generation for ${fileId} (${mimeType}) - placeholder`);
  }

  async initializeMetadataSystem() {
    console.log('Initializing metadata tracking system...');
    // Placeholder for metadata system initialization
  }

  async initializeFileAnalytics() {
    console.log('Initializing file analytics system...');
    // Placeholder for analytics system initialization
  }

  async configureStreamingOptimization() {
    console.log('Configuring streaming optimization...');
    // Placeholder for streaming optimization configuration
  }
}

// Benefits of MongoDB GridFS for File Storage:
// - Native support for files larger than 16MB (BSON document size limit)
// - Automatic chunking and reassembly for efficient streaming operations
// - Built-in metadata storage and indexing capabilities
// - Seamless integration with MongoDB's replication and sharding
// - ACID transaction support for file operations
// - Comprehensive file versioning and relationship management
// - Advanced query capabilities on file metadata and content
// - Automatic load balancing and fault tolerance
// - Integration with MongoDB's authentication and authorization
// - Unified backup and restore with application data

module.exports = {
  EnterpriseGridFSManager
};

Understanding MongoDB GridFS Architecture

Advanced File Management Patterns and Performance Optimization

Implement sophisticated GridFS strategies for production-scale file management:

// Production-ready GridFS with advanced file management and optimization patterns
class ProductionGridFSPlatform extends EnterpriseGridFSManager {
  constructor(db, config = {}) {
    super(db, config.bucketName);

    this.productionConfig = {
      ...config,

      // Advanced GridFS configuration
      replicationFactor: 3,
      shardingStrategy: 'file_size_based',
      compressionAlgorithm: 'zstd',
      encryptionEnabled: true,

      // Performance optimization
      chunkCaching: {
        enabled: true,
        maxCacheSize: '2GB',
        ttl: 3600 // 1 hour
      },

      // Storage tiering
      storageTiers: {
        hot: { accessFrequency: 'daily', compressionLevel: 'fast' },
        warm: { accessFrequency: 'weekly', compressionLevel: 'balanced' },
        cold: { accessFrequency: 'monthly', compressionLevel: 'maximum' },
        archive: { accessFrequency: 'yearly', compressionLevel: 'ultra' }
      },

      // Advanced features
      contentDeduplication: true,
      automaticTiering: true,
      virusScanning: true,
      contentIndexing: true
    };

    this.initializeProductionFeatures();
  }

  async implementAdvancedFileManagement() {
    console.log('Implementing advanced production file management...');

    const managementFeatures = {
      // Automated storage tiering
      storageTiering: await this.implementStorageTiering(),

      // Content deduplication
      deduplication: await this.setupContentDeduplication(),

      // Advanced security features
      security: await this.implementAdvancedSecurity(),

      // Performance monitoring and optimization
      performance: await this.setupPerformanceOptimization(),

      // Disaster recovery and backup
      backup: await this.configureBackupStrategies(),

      // Content delivery optimization
      cdn: await this.setupCDNIntegration()
    };

    return {
      features: managementFeatures,
      monitoring: await this.setupProductionMonitoring(),
      maintenance: await this.configureAutomatedMaintenance()
    };
  }

  async implementStorageTiering() {
    console.log('Implementing automated storage tiering...');

    const tieringStrategy = {
      // Automatic tier migration based on access patterns
      migrationRules: [
        {
          condition: 'accessCount < 5 AND ageInDays > 30',
          action: 'migrate_to_warm',
          compressionIncrease: true
        },
        {
          condition: 'accessCount < 2 AND ageInDays > 90', 
          action: 'migrate_to_cold',
          compressionMaximize: true
        },
        {
          condition: 'accessCount = 0 AND ageInDays > 365',
          action: 'migrate_to_archive',
          compressionUltra: true
        }
      ],

      // Performance optimization per tier
      tierOptimization: {
        hot: { 
          chunkSize: 261120,
          cachePolicy: 'aggressive',
          replicationFactor: 3 
        },
        warm: { 
          chunkSize: 524288,
          cachePolicy: 'moderate', 
          replicationFactor: 2
        },
        cold: { 
          chunkSize: 1048576,
          cachePolicy: 'minimal',
          replicationFactor: 1
        }
      }
    };

    // Implement tiering automation
    await this.setupTieringAutomation(tieringStrategy);

    return tieringStrategy;
  }

  async setupContentDeduplication() {
    console.log('Setting up content deduplication system...');

    const deduplicationSystem = {
      // Hash-based deduplication
      hashingStrategy: {
        algorithm: 'sha256',
        chunkLevel: true,
        fileLevel: true,
        crossBucketDeduplication: true
      },

      // Reference counting for shared chunks
      referenceManagement: {
        chunkReferences: true,
        garbageCollection: true,
        orphanCleanup: true
      },

      // Space savings tracking
      savingsTracking: {
        enabled: true,
        reportingInterval: 'daily',
        alertThresholds: {
          spaceReclaimed: '1GB',
          deduplicationRatio: 0.2
        }
      }
    };

    // Create deduplication indexes and processes
    await this.implementDeduplicationSystem(deduplicationSystem);

    return deduplicationSystem;
  }

  async implementAdvancedSecurity() {
    console.log('Implementing advanced security features...');

    const securityFeatures = {
      // Encryption at rest and in transit
      encryption: {
        atRest: {
          algorithm: 'AES-256-GCM',
          keyRotation: 'quarterly',
          keyManagement: 'vault_integration'
        },
        inTransit: {
          tlsMinVersion: '1.3',
          certificateValidation: 'strict'
        }
      },

      // Access control and auditing
      accessControl: {
        roleBasedAccess: true,
        attributeBasedAccess: true,
        temporaryAccess: true,
        shareLinksWithExpiration: true
      },

      // Content security
      contentSecurity: {
        virusScanning: {
          enabled: true,
          scanOnUpload: true,
          quarantineInfected: true
        },
        contentFiltering: {
          enabled: true,
          malwareDetection: true,
          dataLossPreventionRules: []
        }
      },

      // Audit and compliance
      auditing: {
        accessLogging: true,
        modificationTracking: true,
        retentionPolicies: true,
        complianceReporting: true
      }
    };

    await this.deploySecurityFeatures(securityFeatures);

    return securityFeatures;
  }

  async setupPerformanceOptimization() {
    console.log('Setting up performance optimization system...');

    const performanceOptimization = {
      // Intelligent caching strategies
      caching: {
        chunkCaching: {
          memoryCache: '2GB',
          diskCache: '20GB',
          distributedCache: true
        },
        metadataCache: {
          size: '500MB',
          ttl: '1 hour'
        },
        preloadStrategies: {
          popularFiles: true,
          sequentialAccess: true,
          userPatterns: true
        }
      },

      // Connection and streaming optimization
      streaming: {
        connectionPooling: {
          minConnections: 10,
          maxConnections: 100,
          connectionTimeout: '30s'
        },
        chunkOptimization: {
          adaptiveChunkSize: true,
          parallelStreaming: true,
          compressionOnTheFly: true
        }
      },

      // Load balancing and scaling
      scaling: {
        autoScaling: {
          enabled: true,
          metrics: ['cpu', 'memory', 'io'],
          thresholds: { cpu: 70, memory: 80, io: 75 }
        },
        loadBalancing: {
          algorithm: 'least_connections',
          healthChecks: true,
          failoverTimeout: '5s'
        }
      }
    };

    await this.deployPerformanceOptimization(performanceOptimization);

    return performanceOptimization;
  }

  // Advanced implementation methods

  async setupTieringAutomation(strategy) {
    // Create background job for automated tiering
    const tieringJob = {
      schedule: '0 2 * * *', // Daily at 2 AM
      action: async () => {
        console.log('Running automated storage tiering...');

        // Analyze file access patterns
        const analysisResults = await this.analyzeFileAccessPatterns();

        // Apply tiering rules
        for (const rule of strategy.migrationRules) {
          await this.applyTieringRule(rule, analysisResults);
        }

        // Generate tiering report
        await this.generateTieringReport();
      }
    };

    // Schedule the tiering automation
    await this.scheduleBackgroundJob('storage_tiering', tieringJob);
  }

  async implementDeduplicationSystem(system) {
    // Create deduplication tracking collections
    await this.collections.chunks.createIndex({ 'data': 'hashed' });

    // Setup chunk reference tracking
    const chunkReferences = this.db.collection('chunk_references');
    await chunkReferences.createIndex({ chunkHash: 1, refCount: 1 });

    // Implement deduplication logic in upload process
    this.enableChunkDeduplication = true;
  }

  async deploySecurityFeatures(features) {
    // Setup encryption middleware
    if (features.encryption.atRest.algorithm) {
      await this.setupEncryptionMiddleware(features.encryption);
    }

    // Configure access control
    await this.setupAccessControlSystem(features.accessControl);

    // Enable content security scanning
    if (features.contentSecurity.virusScanning.enabled) {
      await this.setupVirusScanning(features.contentSecurity.virusScanning);
    }
  }

  async deployPerformanceOptimization(optimization) {
    // Configure caching layers
    await this.setupCachingSystem(optimization.caching);

    // Optimize streaming configuration
    await this.configureStreamingOptimization(optimization.streaming);

    // Setup auto-scaling
    if (optimization.scaling.autoScaling.enabled) {
      await this.configureAutoScaling(optimization.scaling.autoScaling);
    }
  }

  // Monitoring and analytics methods

  async generateComprehensiveAnalytics() {
    const analytics = {
      storageAnalytics: await this.generateStorageAnalytics(),
      performanceAnalytics: await this.generatePerformanceAnalytics(),
      usageAnalytics: await this.generateUsageAnalytics(),
      securityAnalytics: await this.generateSecurityAnalytics()
    };

    return analytics;
  }

  async generateStorageAnalytics() {
    const pipeline = [
      {
        $group: {
          _id: null,
          totalFiles: { $sum: 1 },
          totalStorage: { $sum: '$length' },
          avgFileSize: { $avg: '$length' },
          minFileSize: { $min: '$length' },
          maxFileSize: { $max: '$length' },

          // Storage by content type
          imageFiles: { $sum: { $cond: [{ $regexMatch: { input: '$metadata.contentType', regex: '^image/' } }, 1, 0] } },
          videoFiles: { $sum: { $cond: [{ $regexMatch: { input: '$metadata.contentType', regex: '^video/' } }, 1, 0] } },
          documentFiles: { $sum: { $cond: [{ $regexMatch: { input: '$metadata.contentType', regex: '^application/' } }, 1, 0] } },

          // Storage by size category
          smallFiles: { $sum: { $cond: [{ $lt: ['$length', 1048576] }, 1, 0] } }, // < 1MB
          mediumFiles: { $sum: { $cond: [{ $and: [{ $gte: ['$length', 1048576] }, { $lt: ['$length', 104857600] }] }, 1, 0] } }, // 1MB - 100MB
          largeFiles: { $sum: { $cond: [{ $gte: ['$length', 104857600] }, 1, 0] } }, // > 100MB

          // Storage by age
          recentFiles: { $sum: { $cond: [{ $gte: ['$uploadDate', new Date(Date.now() - 30 * 24 * 60 * 60 * 1000)] }, 1, 0] } }
        }
      }
    ];

    const results = await this.collections.files.aggregate(pipeline).toArray();
    return results[0] || {};
  }

  async setupProductionMonitoring() {
    const monitoring = {
      metrics: [
        'storage_utilization',
        'upload_throughput', 
        'download_throughput',
        'cache_hit_ratio',
        'deduplication_savings',
        'security_events'
      ],

      alerts: [
        { metric: 'storage_utilization', threshold: 85, severity: 'warning' },
        { metric: 'upload_throughput', threshold: 100, severity: 'critical' },
        { metric: 'cache_hit_ratio', threshold: 70, severity: 'warning' }
      ],

      dashboards: [
        'storage_overview',
        'performance_metrics', 
        'security_dashboard',
        'usage_analytics'
      ]
    };

    return monitoring;
  }

  async initializeProductionFeatures() {
    console.log('Initializing production GridFS features...');
    // Placeholder for production feature initialization
  }

  async configureAutomatedMaintenance() {
    return {
      tasks: [
        'chunk_optimization',
        'metadata_cleanup',
        'performance_tuning',
        'security_updates'
      ],
      schedule: 'daily_2am'
    };
  }
}

SQL-Style File Management with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB GridFS operations and file management:

-- QueryLeaf GridFS file management with SQL-familiar syntax

-- Create virtual tables for file management operations
CREATE FILE_STORAGE TABLE documents_bucket 
USING GRIDFS (
  bucket_name = 'documents',
  chunk_size = 261120,
  compression = true,
  encryption = true
)
WITH STORAGE_OPTIONS (
  auto_tiering = true,
  deduplication = true,
  virus_scanning = true,

  -- Storage tier configuration
  tier_hot = { access_frequency = 'daily', compression_level = 'fast' },
  tier_warm = { access_frequency = 'weekly', compression_level = 'balanced' },
  tier_cold = { access_frequency = 'monthly', compression_level = 'maximum' }
);

-- Upload files with comprehensive metadata
INSERT INTO documents_bucket (
  filename,
  content,
  content_type,
  metadata
) VALUES (
  'project_document.pdf',
  LOAD_FILE('/uploads/project_document.pdf'),
  'application/pdf',
  JSON_OBJECT(
    'uploader', 'john.doe@company.com',
    'project_id', 'PROJ-2024-001',
    'department', 'engineering',
    'classification', 'confidential',
    'tags', JSON_ARRAY('project', 'specification', '2024'),
    'folder', '/projects/2024/specifications',
    'permissions', JSON_OBJECT(
      'public', false,
      'users', JSON_ARRAY('john.doe', 'jane.smith', 'team.lead'),
      'roles', JSON_ARRAY('project_manager', 'engineer')
    ),
    'custom_fields', JSON_OBJECT(
      'review_status', 'pending',
      'approval_required', true,
      'retention_years', 7
    )
  )
);

-- Comprehensive file search and management queries
WITH file_analytics AS (
  SELECT 
    file_id,
    filename,
    file_size,
    content_type,
    upload_date,
    metadata->>'$.uploader' as uploader,
    metadata->>'$.department' as department,
    metadata->>'$.folder' as folder_path,
    JSON_EXTRACT(metadata, '$.tags') as tags,

    -- File age calculation
    DATEDIFF(CURRENT_DATE, upload_date) as age_days,

    -- Size categorization
    CASE 
      WHEN file_size < 1048576 THEN 'Small (<1MB)'
      WHEN file_size < 104857600 THEN 'Medium (1-100MB)'  
      WHEN file_size < 1073741824 THEN 'Large (100MB-1GB)'
      ELSE 'Very Large (>1GB)'
    END as size_category,

    -- Content type categorization
    CASE
      WHEN content_type LIKE 'image/%' THEN 'Image'
      WHEN content_type LIKE 'video/%' THEN 'Video'
      WHEN content_type LIKE 'audio/%' THEN 'Audio'
      WHEN content_type IN ('application/pdf', 'application/msword', 
                           'application/vnd.openxmlformats-officedocument.wordprocessingml.document') 
        THEN 'Document'
      WHEN content_type LIKE 'text/%' THEN 'Text'
      ELSE 'Other'
    END as content_category,

    -- Access pattern analysis from analytics collection
    COALESCE(a.access_count, 0) as total_accesses,
    COALESCE(a.last_access_date, upload_date) as last_accessed,

    -- Storage tier recommendation
    CASE
      WHEN COALESCE(a.access_count, 0) = 0 AND DATEDIFF(CURRENT_DATE, upload_date) > 365 
        THEN 'archive'
      WHEN COALESCE(a.access_count, 0) < 5 AND DATEDIFF(CURRENT_DATE, upload_date) > 90 
        THEN 'cold'
      WHEN COALESCE(a.access_count, 0) < 20 AND DATEDIFF(CURRENT_DATE, upload_date) > 30 
        THEN 'warm'  
      ELSE 'hot'
    END as recommended_tier

  FROM documents_bucket fb
  LEFT JOIN (
    SELECT 
      file_id,
      COUNT(*) as access_count,
      MAX(access_date) as last_access_date,
      AVG(download_duration_ms) as avg_download_time
    FROM file_access_log
    WHERE access_date >= DATE_SUB(CURRENT_DATE, INTERVAL 1 YEAR)
    GROUP BY file_id
  ) a ON fb.file_id = a.file_id
),

storage_optimization AS (
  SELECT 
    fa.*,

    -- Deduplication analysis
    COUNT(*) OVER (PARTITION BY CHECKSUM(content)) as duplicate_count,
    CASE 
      WHEN COUNT(*) OVER (PARTITION BY CHECKSUM(content)) > 1 
        THEN 'Deduplication opportunity'
      ELSE 'Unique file'
    END as deduplication_status,

    -- Compression potential
    CASE
      WHEN content_category IN ('Text', 'Document') AND file_size > 1048576 
        THEN 'High compression potential'
      WHEN content_category = 'Image' AND content_type NOT IN ('image/jpeg', 'image/png')
        THEN 'Moderate compression potential'
      ELSE 'Low compression potential'  
    END as compression_potential,

    -- Storage cost analysis
    file_size * 
      CASE recommended_tier
        WHEN 'hot' THEN 0.10
        WHEN 'warm' THEN 0.05  
        WHEN 'cold' THEN 0.02
        WHEN 'archive' THEN 0.01
      END as estimated_monthly_storage_cost,

    -- Performance impact assessment
    CASE
      WHEN total_accesses > 100 AND file_size > 104857600 
        THEN 'High performance impact - consider optimization'
      WHEN total_accesses > 50 AND age_days < 30
        THEN 'Frequently accessed - ensure hot tier placement'
      WHEN total_accesses = 0 AND age_days > 30
        THEN 'Unused file - candidate for archival or deletion'
      ELSE 'Normal performance profile'
    END as performance_assessment

  FROM file_analytics fa
),

security_compliance AS (
  SELECT 
    so.*,

    -- Access control validation
    CASE
      WHEN JSON_EXTRACT(metadata, '$.classification') = 'confidential' AND 
           JSON_EXTRACT(metadata, '$.permissions.public') = true
        THEN 'SECURITY RISK: Confidential file marked as public'
      WHEN JSON_LENGTH(JSON_EXTRACT(metadata, '$.permissions.users')) > 10
        THEN 'WARNING: File shared with many users'
      WHEN metadata->>'$.permissions' IS NULL
        THEN 'WARNING: No explicit permissions defined'
      ELSE 'Access control compliant'
    END as security_status,

    -- Retention policy compliance
    CASE
      WHEN metadata->>'$.retention_years' IS NOT NULL AND 
           age_days > (CAST(metadata->>'$.retention_years' AS SIGNED) * 365)
        THEN 'COMPLIANCE: File exceeds retention period - schedule for deletion'
      WHEN metadata->>'$.retention_years' IS NULL
        THEN 'WARNING: No retention policy defined'
      ELSE 'Retention policy compliant'
    END as retention_status,

    -- Data classification validation
    CASE
      WHEN metadata->>'$.classification' IS NULL
        THEN 'WARNING: No data classification assigned'
      WHEN metadata->>'$.classification' = 'confidential' AND department = 'public'
        THEN 'ERROR: Classification mismatch with department'
      ELSE 'Classification appropriate'
    END as classification_status

  FROM storage_optimization so
)

-- Final comprehensive file management report
SELECT 
  -- File identification
  sc.file_id,
  sc.filename,
  sc.folder_path,
  sc.uploader,
  sc.department,

  -- File characteristics
  sc.size_category,
  sc.content_category,
  ROUND(sc.file_size / 1024 / 1024, 2) as size_mb,
  sc.age_days,

  -- Access patterns
  sc.total_accesses,
  DATEDIFF(CURRENT_DATE, sc.last_accessed) as days_since_access,

  -- Storage optimization
  sc.recommended_tier,
  sc.deduplication_status,
  sc.compression_potential,
  ROUND(sc.estimated_monthly_storage_cost, 4) as monthly_cost_usd,

  -- Performance and security
  sc.performance_assessment,
  sc.security_status,
  sc.retention_status,
  sc.classification_status,

  -- Action recommendations
  CASE
    WHEN sc.security_status LIKE 'SECURITY RISK%' THEN 'URGENT: Review security settings'
    WHEN sc.retention_status LIKE 'COMPLIANCE%' THEN 'SCHEDULE: File deletion per retention policy'
    WHEN sc.recommended_tier != 'hot' AND sc.total_accesses > 20 THEN 'OPTIMIZE: Move to hot tier'
    WHEN sc.duplicate_count > 1 THEN 'OPTIMIZE: Implement deduplication'
    WHEN sc.performance_assessment LIKE 'High performance impact%' THEN 'OPTIMIZE: File size or access pattern'
    ELSE 'MAINTAIN: No immediate action required'
  END as recommended_action,

  -- Priority calculation
  CASE
    WHEN sc.security_status LIKE 'SECURITY RISK%' OR sc.security_status LIKE 'ERROR%' THEN 'CRITICAL'
    WHEN sc.retention_status LIKE 'COMPLIANCE%' THEN 'HIGH'
    WHEN sc.performance_assessment LIKE 'High performance impact%' THEN 'HIGH'
    WHEN sc.deduplication_status = 'Deduplication opportunity' AND sc.file_size > 10485760 THEN 'MEDIUM'
    WHEN sc.recommended_tier = 'archive' AND sc.total_accesses = 0 THEN 'MEDIUM'
    ELSE 'LOW'
  END as priority_level

FROM security_compliance sc
WHERE sc.recommended_action != 'MAINTAIN: No immediate action required'
   OR sc.priority_level IN ('CRITICAL', 'HIGH')
ORDER BY 
  CASE priority_level 
    WHEN 'CRITICAL' THEN 1 
    WHEN 'HIGH' THEN 2 
    WHEN 'MEDIUM' THEN 3 
    ELSE 4 
  END,
  sc.file_size DESC;

-- File streaming and download operations with SQL syntax
SELECT 
  file_id,
  filename,
  content_type,
  file_size,

  -- Generate streaming URLs for different access patterns
  CONCAT('/api/files/stream/', file_id) as stream_url,
  CONCAT('/api/files/download/', file_id, '?filename=', URLENCODE(filename)) as download_url,

  -- Generate thumbnail/preview URLs for supported content types  
  CASE
    WHEN content_type LIKE 'image/%' THEN 
      CONCAT('/api/files/thumbnail/', file_id, '?size=200x200')
    WHEN content_type = 'application/pdf' THEN
      CONCAT('/api/files/preview/', file_id, '?page=1&format=image')
    WHEN content_type LIKE 'video/%' THEN
      CONCAT('/api/files/thumbnail/', file_id, '?time=00:00:05')
    ELSE NULL
  END as preview_url,

  -- Generate sharing links with expiration
  GENERATE_SHARE_LINK(file_id, '7 days', 'read') as temporary_share_link,

  -- Content delivery optimization
  CASE
    WHEN total_accesses > 100 THEN 'CDN_RECOMMENDED'
    WHEN file_size > 104857600 THEN 'STREAMING_RECOMMENDED'  
    WHEN content_type LIKE 'image/%' THEN 'CACHE_AGGRESSIVE'
    ELSE 'STANDARD_DELIVERY'
  END as delivery_optimization

FROM file_analytics
WHERE security_status NOT LIKE 'SECURITY RISK%'
  AND (
    total_accesses > 10 OR 
    upload_date >= DATE_SUB(CURRENT_DATE, INTERVAL 7 DAY)
  )
ORDER BY total_accesses DESC, file_size DESC;

-- File versioning and history management
CREATE FILE_VERSIONS TABLE document_versions
USING GRIDFS (
  bucket_name = 'document_versions',
  parent_table = 'documents_bucket'
);

-- Version management operations
WITH version_analysis AS (
  SELECT 
    original_file_id,
    COUNT(*) as version_count,
    MAX(version_number) as latest_version,
    SUM(file_size) as total_version_storage,
    MIN(created_date) as first_version_date,
    MAX(created_date) as latest_version_date,

    -- Calculate storage overhead from versioning
    file_size as original_size,
    SUM(file_size) - file_size as version_overhead_bytes,
    ROUND(((SUM(file_size) - file_size) / file_size * 100), 2) as version_overhead_pct

  FROM document_versions dv
  JOIN documents_bucket db ON dv.original_file_id = db.file_id
  GROUP BY original_file_id, file_size
),
version_optimization AS (
  SELECT 
    va.*,

    -- Version cleanup recommendations
    CASE
      WHEN version_count > 10 AND version_overhead_pct > 300 
        THEN 'Aggressive cleanup recommended - keep last 3 versions'
      WHEN version_count > 5 AND version_overhead_pct > 200
        THEN 'Moderate cleanup recommended - keep last 5 versions'  
      WHEN version_count > 20
        THEN 'Version limit enforcement recommended'
      ELSE 'Version count acceptable'
    END as cleanup_recommendation,

    -- Storage impact assessment
    CASE
      WHEN total_version_storage > 1073741824 -- >1GB
        THEN 'High storage impact - prioritize optimization'
      WHEN total_version_storage > 104857600 -- >100MB  
        THEN 'Moderate storage impact - monitor'
      ELSE 'Low storage impact'
    END as storage_impact

  FROM version_analysis va
)

SELECT 
  original_file_id,
  (SELECT filename FROM documents_bucket WHERE file_id = vo.original_file_id) as filename,
  version_count,
  latest_version,
  ROUND(total_version_storage / 1024 / 1024, 2) as total_storage_mb,
  ROUND(version_overhead_bytes / 1024 / 1024, 2) as overhead_mb, 
  version_overhead_pct,
  cleanup_recommendation,
  storage_impact,

  -- Generate cleanup commands
  CASE cleanup_recommendation
    WHEN 'Aggressive cleanup recommended - keep last 3 versions' THEN
      CONCAT('DELETE FROM document_versions WHERE original_file_id = ''', original_file_id, 
             ''' AND version_number <= ', latest_version - 3)
    WHEN 'Moderate cleanup recommended - keep last 5 versions' THEN  
      CONCAT('DELETE FROM document_versions WHERE original_file_id = ''', original_file_id,
             ''' AND version_number <= ', latest_version - 5)
    ELSE 'No cleanup required'
  END as cleanup_command

FROM version_optimization vo
WHERE version_count > 3
ORDER BY total_version_storage DESC, version_overhead_pct DESC;

-- QueryLeaf GridFS capabilities:
-- 1. SQL-familiar syntax for GridFS file operations and management
-- 2. Advanced file metadata querying and analytics with JSON operations
-- 3. Automated storage tiering and optimization recommendations
-- 4. Comprehensive security and compliance validation
-- 5. File versioning and history management with cleanup automation
-- 6. Performance optimization through intelligent caching and delivery
-- 7. Content deduplication and compression analysis
-- 8. Integration with MongoDB's native GridFS capabilities
-- 9. Real-time file analytics and usage pattern analysis  
-- 10. Production-ready file management with monitoring and alerting

Best Practices for Production GridFS Management

File Storage Architecture and Performance Optimization

Essential principles for effective MongoDB GridFS deployment and management:

Chunk Size Optimization: Configure appropriate chunk sizes (255KB default) based on file types and access patterns for optimal streaming performance
Index Strategy: Implement comprehensive indexing on metadata fields for fast file discovery and management operations
Storage Tiering: Design automated storage tiering strategies based on access frequency and file age for cost optimization
Content Deduplication: Implement hash-based deduplication to reduce storage overhead and improve efficiency
Security Integration: Deploy encryption, access control, and content scanning for enterprise security requirements
Performance Monitoring: Track upload/download throughput, cache hit ratios, and storage utilization continuously

Scalability and Production Optimization

Optimize GridFS deployments for enterprise-scale file management:

Sharding Strategy: Design effective sharding strategies for large file collections based on access patterns and geographic distribution
Replication Configuration: Configure appropriate replication factors based on availability requirements and storage costs
Caching Implementation: Deploy multi-tier caching (memory, disk, distributed) for frequently accessed files
Content Delivery: Integrate with CDN services for global file distribution and performance optimization
Backup Management: Implement comprehensive backup strategies that handle both metadata and binary content efficiently
Resource Management: Monitor and optimize CPU, memory, and storage resources for sustained performance

Conclusion

MongoDB GridFS provides comprehensive file storage capabilities that seamlessly integrate binary data management with document database operations, enabling applications to handle large files while maintaining ACID properties, metadata relationships, and query capabilities. The unified data management approach eliminates the complexity of external file systems while providing enterprise-grade features for security, performance, and scalability.

Key MongoDB GridFS benefits include:

Unified Data Management: Seamless integration of file storage with document data within the same database system
Scalable Architecture: Native support for large files with automatic chunking and streaming capabilities
Advanced Metadata: Comprehensive metadata storage and indexing for powerful file discovery and management
Production Features: Enterprise security, encryption, deduplication, and automated storage tiering
Performance Optimization: Intelligent caching, compression, and content delivery optimization
Operational Simplicity: Unified backup, replication, and monitoring with existing MongoDB infrastructure

Whether you're building content management systems, media platforms, document repositories, or IoT data storage solutions, MongoDB GridFS with QueryLeaf's familiar SQL interface provides the foundation for scalable and efficient file management.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB GridFS operations while providing SQL-familiar syntax for file upload, download, search, and management operations. Advanced file management patterns, storage optimization strategies, and production-ready features are seamlessly handled through familiar SQL constructs, making sophisticated file storage both powerful and accessible to SQL-oriented development teams.

The combination of MongoDB's robust GridFS capabilities with SQL-style file operations makes it an ideal platform for applications requiring both advanced file management and familiar database interaction patterns, ensuring your file storage solutions remain performant, secure, and maintainable as they scale.

November 12, 2025
23 min read

MongoDB Atlas Vector Search for AI Applications: Building Semantic Search and Retrieval-Augmented Generation Systems with SQL-Style Operations

Modern AI applications require sophisticated data retrieval capabilities that go beyond traditional text matching to understand semantic meaning, context, and conceptual similarity. Vector search technology enables applications to find relevant information based on meaning rather than exact keyword matches, powering everything from recommendation engines to retrieval-augmented generation (RAG) systems.

MongoDB Atlas Vector Search provides native vector database capabilities integrated directly into MongoDB's document model, enabling developers to build AI applications without managing separate vector databases. Unlike standalone vector databases that require complex data synchronization and additional infrastructure, Atlas Vector Search combines traditional document operations with vector similarity search in a single, scalable platform.

The Traditional Vector Search Infrastructure Challenge

Building AI applications with traditional vector databases often requires complex, fragmented infrastructure:

-- Traditional PostgreSQL with pgvector extension - complex setup and limited scalability

-- Enable vector extension (requires superuser privileges)
CREATE EXTENSION IF NOT EXISTS vector;

-- Create table for document storage with vector embeddings
CREATE TABLE document_embeddings (
    document_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    title TEXT NOT NULL,
    content TEXT NOT NULL,
    source_url TEXT,
    document_type VARCHAR(50),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Vector embedding column (limited to 16,000 dimensions in pgvector)
    embedding vector(1536), -- OpenAI embedding dimension

    -- Metadata for filtering
    category VARCHAR(100),
    language VARCHAR(10) DEFAULT 'en',
    author VARCHAR(200),
    tags TEXT[],

    -- Full-text search support
    search_vector tsvector GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(content, '')), 'B')
    ) STORED
);

-- Vector similarity index (limited indexing options)
CREATE INDEX embedding_idx ON document_embeddings 
USING ivfflat (embedding vector_cosine_ops) 
WITH (lists = 1000); -- Requires manual tuning

-- Full-text search index
CREATE INDEX document_search_idx ON document_embeddings USING GIN(search_vector);

-- Compound index for metadata filtering
CREATE INDEX document_metadata_idx ON document_embeddings(category, language, created_at);

-- Complex vector similarity search with metadata filtering
WITH vector_search AS (
  SELECT 
    document_id,
    title,
    content,
    category,
    author,
    created_at,

    -- Cosine similarity calculation
    1 - (embedding <=> $1::vector) as similarity_score,

    -- L2 distance (alternative metric)
    embedding <-> $1::vector as l2_distance,

    -- Inner product similarity  
    (embedding <#> $1::vector) * -1 as inner_product_similarity,

    -- Hybrid scoring combining vector and text search
    ts_rank(search_vector, plainto_tsquery('english', $2)) as text_relevance_score

  FROM document_embeddings
  WHERE 
    -- Metadata filtering (applied before vector search for performance)
    category = ANY($3::text[]) 
    AND language = $4
    AND created_at >= $5::timestamp

    -- Optional full-text pre-filtering
    AND (CASE WHEN $2 IS NOT NULL AND $2 != '' 
         THEN search_vector @@ plainto_tsquery('english', $2)
         ELSE true END)
),

ranked_results AS (
  SELECT *,
    -- Hybrid ranking combining multiple signals
    (0.7 * similarity_score + 0.3 * text_relevance_score) as hybrid_score,

    -- Relevance classification
    CASE 
      WHEN similarity_score >= 0.8 THEN 'highly_relevant'
      WHEN similarity_score >= 0.6 THEN 'relevant'  
      WHEN similarity_score >= 0.4 THEN 'somewhat_relevant'
      ELSE 'low_relevance'
    END as relevance_category,

    -- Diversity scoring (for result diversification)
    ROW_NUMBER() OVER (PARTITION BY category ORDER BY similarity_score DESC) as category_rank

  FROM vector_search
  WHERE similarity_score >= 0.3 -- Similarity threshold
),

diversified_results AS (
  SELECT *,
    -- Result diversification logic
    CASE 
      WHEN category_rank <= 2 THEN hybrid_score -- Top 2 per category get full score
      WHEN category_rank <= 5 THEN hybrid_score * 0.8 -- Next 3 get reduced score
      ELSE hybrid_score * 0.5 -- Others get significantly reduced score
    END as diversified_score

  FROM ranked_results
)

SELECT 
  document_id,
  title,
  LEFT(content, 500) as content_preview, -- Truncate for performance
  category,
  author,
  created_at,
  ROUND(similarity_score::numeric, 4) as similarity,
  ROUND(text_relevance_score::numeric, 4) as text_relevance,
  ROUND(diversified_score::numeric, 4) as final_score,
  relevance_category,

  -- Highlight matching terms (requires additional processing)
  ts_headline('english', content, plainto_tsquery('english', $2), 
              'MaxWords=50, MinWords=20, MaxFragments=3') as highlighted_content

FROM diversified_results
ORDER BY diversified_score DESC, similarity_score DESC
LIMIT $6::int -- Result limit parameter
OFFSET $7::int; -- Pagination offset

-- Problems with traditional vector database approaches:
-- 1. Complex infrastructure requiring separate vector database setup and management
-- 2. Limited integration between vector search and traditional document operations
-- 3. Manual index tuning and maintenance for optimal vector search performance
-- 4. Difficult data synchronization between operational databases and vector stores
-- 5. Limited scalability and high operational complexity for production deployments
-- 6. Fragmented query capabilities requiring multiple systems for comprehensive search
-- 7. Complex hybrid search implementations combining vector and traditional search
-- 8. Limited support for real-time updates and dynamic vector index management
-- 9. Expensive infrastructure costs for separate specialized vector database systems
-- 10. Difficult migration paths and vendor lock-in with specialized vector database solutions

-- Pinecone example (proprietary vector database)
-- Requires separate service, API calls, and complex data synchronization
-- Limited filtering capabilities and expensive for large-scale applications
-- No native SQL interface or familiar query patterns

-- Weaviate/Chroma examples similarly require:
-- - Separate infrastructure and service management  
-- - Complex data pipeline orchestration
-- - Limited integration with existing application databases
-- - Expensive scaling and operational complexity

MongoDB Atlas Vector Search provides integrated vector database capabilities:

// MongoDB Atlas Vector Search - native integration with document operations
const { MongoClient } = require('mongodb');

// Advanced Atlas Vector Search system for AI applications
class AtlasVectorSearchManager {
  constructor(connectionString, databaseName) {
    this.client = new MongoClient(connectionString);
    this.db = this.client.db(databaseName);
    this.collections = {
      documents: this.db.collection('documents'),
      embeddings: this.db.collection('embeddings'), 
      searchLogs: this.db.collection('search_logs'),
      userProfiles: this.db.collection('user_profiles')
    };

    this.embeddingDimensions = 1536; // OpenAI embedding size
    this.searchConfigs = new Map();
    this.performanceMetrics = new Map();
  }

  async createVectorSearchIndexes() {
    console.log('Creating optimized vector search indexes for AI applications...');

    try {
      // Primary vector search index for document embeddings
      await this.collections.documents.createSearchIndex({
        name: "document_vector_index",
        type: "vectorSearch",
        definition: {
          "fields": [
            {
              "type": "vector",
              "path": "embedding",
              "numDimensions": this.embeddingDimensions,
              "similarity": "cosine"
            },
            {
              "type": "filter", 
              "path": "metadata.category"
            },
            {
              "type": "filter",
              "path": "metadata.language" 
            },
            {
              "type": "filter",
              "path": "metadata.source"
            },
            {
              "type": "filter",
              "path": "created_at"
            },
            {
              "type": "filter",
              "path": "metadata.tags"
            }
          ]
        }
      });

      // Hybrid search index combining full-text and vector search
      await this.collections.documents.createSearchIndex({
        name: "hybrid_search_index",
        type: "search",
        definition: {
          "mappings": {
            "dynamic": false,
            "fields": {
              "title": {
                "type": "text",
                "analyzer": "lucene.standard"
              },
              "content": {
                "type": "text", 
                "analyzer": "lucene.english"
              },
              "metadata": {
                "type": "document",
                "fields": {
                  "category": {
                    "type": "string"
                  },
                  "tags": {
                    "type": "stringFacet"
                  },
                  "language": {
                    "type": "string"
                  }
                }
              }
            }
          }
        }
      });

      // User preference vector index for personalized search
      await this.collections.userProfiles.createSearchIndex({
        name: "user_preference_vector_index",
        type: "vectorSearch", 
        definition: {
          "fields": [
            {
              "type": "vector",
              "path": "preference_embedding",
              "numDimensions": this.embeddingDimensions,
              "similarity": "cosine"
            },
            {
              "type": "filter",
              "path": "user_id"
            },
            {
              "type": "filter", 
              "path": "profile_type"
            }
          ]
        }
      });

      console.log('Vector search indexes created successfully');
      return { success: true, indexes: ['document_vector_index', 'hybrid_search_index', 'user_preference_vector_index'] };

    } catch (error) {
      console.error('Error creating vector search indexes:', error);
      return { success: false, error: error.message };
    }
  }

  async ingestDocumentsWithEmbeddings(documents, embeddingFunction) {
    console.log(`Ingesting ${documents.length} documents with vector embeddings...`);

    const batchSize = 100;
    const batches = [];
    let totalIngested = 0;

    // Process documents in batches for optimal performance
    for (let i = 0; i < documents.length; i += batchSize) {
      const batch = documents.slice(i, i + batchSize);
      batches.push(batch);
    }

    for (const [batchIndex, batch] of batches.entries()) {
      console.log(`Processing batch ${batchIndex + 1}/${batches.length}`);

      try {
        // Generate embeddings for batch
        const batchTexts = batch.map(doc => `${doc.title}\n\n${doc.content}`);
        const embeddings = await embeddingFunction(batchTexts);

        // Prepare documents with embeddings and metadata
        const enrichedDocuments = batch.map((doc, index) => ({
          _id: doc._id || new ObjectId(),
          title: doc.title,
          content: doc.content,

          // Vector embedding
          embedding: embeddings[index],

          // Rich metadata for filtering and analytics
          metadata: {
            category: doc.category || 'general',
            subcategory: doc.subcategory,
            language: doc.language || 'en',
            source: doc.source || 'unknown',
            source_url: doc.source_url,
            author: doc.author,
            tags: doc.tags || [],

            // Content analysis metadata
            word_count: this.calculateWordCount(doc.content),
            reading_time_minutes: Math.ceil(this.calculateWordCount(doc.content) / 200),
            content_type: this.inferContentType(doc),
            sentiment_score: doc.sentiment_score,

            // Technical metadata
            extraction_method: doc.extraction_method || 'manual',
            processing_version: '1.0',
            quality_score: this.calculateQualityScore(doc)
          },

          // Timestamps
          created_at: doc.created_at || new Date(),
          updated_at: new Date(),
          indexed_at: new Date(),

          // Search optimization fields
          searchable_text: `${doc.title} ${doc.content} ${(doc.tags || []).join(' ')}`,

          // Embedding metadata
          embedding_model: 'text-embedding-ada-002',
          embedding_dimensions: this.embeddingDimensions,
          embedding_created_at: new Date()
        }));

        // Bulk insert with error handling
        const result = await this.collections.documents.insertMany(enrichedDocuments, {
          ordered: false,
          writeConcern: { w: 'majority' }
        });

        totalIngested += result.insertedCount;
        console.log(`Batch ${batchIndex + 1} completed: ${result.insertedCount} documents ingested`);

      } catch (error) {
        console.error(`Error processing batch ${batchIndex + 1}:`, error);
        continue; // Continue with next batch
      }
    }

    console.log(`Document ingestion completed: ${totalIngested}/${documents.length} documents successfully ingested`);
    return {
      success: true,
      totalIngested,
      totalDocuments: documents.length,
      successRate: (totalIngested / documents.length * 100).toFixed(2)
    };
  }

  async performSemanticSearch(queryEmbedding, options = {}) {
    console.log('Performing semantic vector search...');

    const {
      limit = 10,
      categories = [],
      language = null,
      source = null,
      tags = [],
      dateRange = null,
      similarityThreshold = 0.7,
      includeMetadata = true,
      boostFactors = {},
      userProfile = null
    } = options;

    // Build filter criteria
    const filterCriteria = [];

    if (categories.length > 0) {
      filterCriteria.push({
        "metadata.category": { $in: categories }
      });
    }

    if (language) {
      filterCriteria.push({
        "metadata.language": { $eq: language }
      });
    }

    if (source) {
      filterCriteria.push({
        "metadata.source": { $eq: source }
      });
    }

    if (tags.length > 0) {
      filterCriteria.push({
        "metadata.tags": { $in: tags }
      });
    }

    if (dateRange) {
      filterCriteria.push({
        "created_at": {
          $gte: dateRange.start,
          $lte: dateRange.end
        }
      });
    }

    try {
      // Build aggregation pipeline for vector search
      const pipeline = [
        {
          $vectorSearch: {
            index: "document_vector_index",
            path: "embedding",
            queryVector: queryEmbedding,
            numCandidates: limit * 10, // Search more candidates for better results
            limit: limit * 2, // Get extra results for post-processing
            ...(filterCriteria.length > 0 && {
              filter: {
                $and: filterCriteria
              }
            })
          }
        },

        // Add similarity score
        {
          $addFields: {
            similarity_score: { $meta: "vectorSearchScore" }
          }
        },

        // Filter by similarity threshold
        {
          $match: {
            similarity_score: { $gte: similarityThreshold }
          }
        },

        // Add computed fields for ranking
        {
          $addFields: {
            // Content quality boost
            quality_boost: {
              $multiply: [
                "$metadata.quality_score",
                boostFactors.quality || 1.0
              ]
            },

            // Recency boost
            recency_boost: {
              $multiply: [
                {
                  $divide: [
                    { $subtract: [new Date(), "$created_at"] },
                    86400000 * 365 // Days in milliseconds
                  ]
                },
                boostFactors.recency || 0.1
              ]
            },

            // Source authority boost
            source_boost: {
              $switch: {
                branches: [
                  { case: { $eq: ["$metadata.source", "official"] }, then: boostFactors.official || 1.2 },
                  { case: { $eq: ["$metadata.source", "expert"] }, then: boostFactors.expert || 1.1 }
                ],
                default: 1.0
              }
            }
          }
        },

        // Calculate final ranking score
        {
          $addFields: {
            final_score: {
              $multiply: [
                "$similarity_score",
                {
                  $add: [
                    1.0,
                    "$quality_boost",
                    "$recency_boost", 
                    "$source_boost"
                  ]
                }
              ]
            },

            // Relevance classification
            relevance_category: {
              $switch: {
                branches: [
                  { case: { $gte: ["$similarity_score", 0.9] }, then: "highly_relevant" },
                  { case: { $gte: ["$similarity_score", 0.8] }, then: "relevant" },
                  { case: { $gte: ["$similarity_score", 0.7] }, then: "somewhat_relevant" }
                ],
                default: "marginally_relevant"
              }
            }
          }
        },

        // Add personalization if user profile provided
        ...(userProfile ? [{
          $lookup: {
            from: "user_profiles",
            let: { doc_category: "$metadata.category", doc_tags: "$metadata.tags" },
            pipeline: [
              {
                $match: {
                  user_id: userProfile.user_id,
                  $expr: {
                    $or: [
                      { $in: ["$$doc_category", "$preferred_categories"] },
                      { $gt: [{ $size: { $setIntersection: ["$$doc_tags", "$preferred_tags"] } }, 0] }
                    ]
                  }
                }
              }
            ],
            as: "user_preference_match"
          }
        }, {
          $addFields: {
            personalization_boost: {
              $cond: {
                if: { $gt: [{ $size: "$user_preference_match" }, 0] },
                then: boostFactors.personalization || 1.15,
                else: 1.0
              }
            },
            final_score: {
              $multiply: ["$final_score", "$personalization_boost"]
            }
          }
        }] : []),

        // Sort by final score
        {
          $sort: { final_score: -1, similarity_score: -1 }
        },

        // Limit results
        {
          $limit: limit
        },

        // Project final fields
        {
          $project: {
            _id: 1,
            title: 1,
            content: 1,
            ...(includeMetadata && { metadata: 1 }),
            similarity_score: { $round: ["$similarity_score", 4] },
            final_score: { $round: ["$final_score", 4] },
            relevance_category: 1,
            created_at: 1,

            // Generate content snippet
            content_snippet: {
              $substr: ["$content", 0, 300]
            },

            // Search result metadata
            search_metadata: {
              embedding_model: "$embedding_model",
              indexed_at: "$indexed_at",
              quality_score: "$metadata.quality_score"
            }
          }
        }
      ];

      const startTime = Date.now();
      const results = await this.collections.documents.aggregate(pipeline).toArray();
      const searchTime = Date.now() - startTime;

      // Log search performance
      this.recordSearchMetrics({
        query_type: 'semantic_vector_search',
        results_count: results.length,
        search_time_ms: searchTime,
        similarity_threshold: similarityThreshold,
        filters_applied: filterCriteria.length,
        timestamp: new Date()
      });

      console.log(`Semantic search completed: ${results.length} results in ${searchTime}ms`);

      return {
        success: true,
        results: results,
        search_metadata: {
          query_type: 'semantic',
          results_count: results.length,
          search_time_ms: searchTime,
          similarity_threshold: similarityThreshold,
          filters_applied: filterCriteria.length,
          personalized: !!userProfile
        }
      };

    } catch (error) {
      console.error('Semantic search error:', error);
      return {
        success: false,
        error: error.message,
        results: []
      };
    }
  }

  async performHybridSearch(query, queryEmbedding, options = {}) {
    console.log('Performing hybrid search combining text and vector similarity...');

    const {
      limit = 10,
      textWeight = 0.3,
      vectorWeight = 0.7,
      categories = [],
      language = 'en'
    } = options;

    try {
      // Execute vector search
      const vectorResults = await this.performSemanticSearch(queryEmbedding, {
        ...options,
        limit: limit * 2 // Get more results for hybrid ranking
      });

      // Execute text search using Atlas Search
      const textSearchPipeline = [
        {
          $search: {
            index: "hybrid_search_index",
            compound: {
              must: [
                {
                  text: {
                    query: query,
                    path: ["title", "content"],
                    fuzzy: {
                      maxEdits: 2,
                      prefixLength: 3
                    }
                  }
                }
              ],
              ...(categories.length > 0 && {
                filter: [
                  {
                    text: {
                      query: categories,
                      path: "metadata.category"
                    }
                  }
                ]
              })
            },
            highlight: {
              path: "content",
              maxCharsToExamine: 1000,
              maxNumPassages: 3
            }
          }
        },
        {
          $addFields: {
            text_score: { $meta: "searchScore" },
            highlights: { $meta: "searchHighlights" }
          }
        },
        {
          $limit: limit * 2
        }
      ];

      const textResults = await this.collections.documents.aggregate(textSearchPipeline).toArray();

      // Combine and rank results using hybrid scoring
      const combinedResults = this.combineHybridResults(
        vectorResults.results || [], 
        textResults,
        textWeight,
        vectorWeight
      );

      // Sort by hybrid score and limit
      combinedResults.sort((a, b) => b.hybrid_score - a.hybrid_score);
      const finalResults = combinedResults.slice(0, limit);

      return {
        success: true,
        results: finalResults,
        search_metadata: {
          query_type: 'hybrid',
          text_results_count: textResults.length,
          vector_results_count: vectorResults.results?.length || 0,
          combined_results_count: combinedResults.length,
          final_results_count: finalResults.length,
          text_weight: textWeight,
          vector_weight: vectorWeight
        }
      };

    } catch (error) {
      console.error('Hybrid search error:', error);
      return {
        success: false,
        error: error.message,
        results: []
      };
    }
  }

  combineHybridResults(vectorResults, textResults, textWeight, vectorWeight) {
    const resultMap = new Map();

    // Normalize scores to 0-1 range
    const maxVectorScore = Math.max(...vectorResults.map(r => r.similarity_score || 0));
    const maxTextScore = Math.max(...textResults.map(r => r.text_score || 0));

    // Process vector results
    vectorResults.forEach(result => {
      const normalizedVectorScore = maxVectorScore > 0 ? result.similarity_score / maxVectorScore : 0;
      resultMap.set(result._id.toString(), {
        ...result,
        normalized_vector_score: normalizedVectorScore,
        normalized_text_score: 0,
        hybrid_score: normalizedVectorScore * vectorWeight
      });
    });

    // Process text results and combine
    textResults.forEach(result => {
      const normalizedTextScore = maxTextScore > 0 ? result.text_score / maxTextScore : 0;
      const docId = result._id.toString();

      if (resultMap.has(docId)) {
        // Document found in both searches - combine scores
        const existing = resultMap.get(docId);
        existing.normalized_text_score = normalizedTextScore;
        existing.hybrid_score = (existing.normalized_vector_score * vectorWeight) + 
                               (normalizedTextScore * textWeight);
        existing.highlights = result.highlights;
        existing.search_type = 'both';
      } else {
        // Document only found in text search
        resultMap.set(docId, {
          ...result,
          normalized_vector_score: 0,
          normalized_text_score: normalizedTextScore,
          hybrid_score: normalizedTextScore * textWeight,
          search_type: 'text_only',
          similarity_score: 0,
          relevance_category: 'text_match'
        });
      }
    });

    return Array.from(resultMap.values());
  }

  async buildRAGPipeline(query, options = {}) {
    console.log('Building Retrieval-Augmented Generation pipeline...');

    const {
      contextLimit = 5,
      maxContextLength = 4000,
      embeddingFunction,
      llmFunction,
      temperature = 0.7,
      includeSourceCitations = true
    } = options;

    try {
      // Step 1: Generate query embedding
      const queryEmbedding = await embeddingFunction([query]);

      // Step 2: Retrieve relevant context using semantic search
      const searchResults = await this.performSemanticSearch(queryEmbedding[0], {
        limit: contextLimit * 2, // Get extra results for context selection
        similarityThreshold: 0.6
      });

      if (!searchResults.success || searchResults.results.length === 0) {
        return {
          success: false,
          error: 'No relevant context found',
          query: query
        };
      }

      // Step 3: Select and rank context documents
      const contextDocuments = this.selectOptimalContext(
        searchResults.results,
        maxContextLength
      );

      // Step 4: Build context string with source tracking
      const contextString = contextDocuments.map((doc, index) => {
        const sourceId = `[${index + 1}]`;
        return `${sourceId} ${doc.title}\n${doc.content_snippet || doc.content.substring(0, 500)}...`;
      }).join('\n\n');

      // Step 5: Create RAG prompt
      const ragPrompt = this.buildRAGPrompt(query, contextString, includeSourceCitations);

      // Step 6: Generate response using LLM
      const llmResponse = await llmFunction(ragPrompt, {
        temperature,
        max_tokens: 1000,
        stop: ["[END]"]
      });

      // Step 7: Extract citations and build response
      const response = {
        success: true,
        query: query,
        answer: llmResponse.text || llmResponse,
        context_used: contextDocuments.length,
        sources: contextDocuments.map((doc, index) => ({
          id: index + 1,
          title: doc.title,
          similarity_score: doc.similarity_score,
          source: doc.metadata?.source,
          url: doc.metadata?.source_url
        })),
        search_metadata: searchResults.search_metadata,
        generation_metadata: {
          model: llmResponse.model || 'unknown',
          temperature: temperature,
          context_length: contextString.length,
          response_tokens: llmResponse.usage?.total_tokens || 0
        }
      };

      // Log RAG pipeline usage
      await this.logRAGUsage({
        query: query,
        context_documents: contextDocuments.length,
        response_length: response.answer.length,
        sources_cited: response.sources.length,
        timestamp: new Date()
      });

      return response;

    } catch (error) {
      console.error('RAG pipeline error:', error);
      return {
        success: false,
        error: error.message,
        query: query
      };
    }
  }

  selectOptimalContext(searchResults, maxLength) {
    let totalLength = 0;
    const selectedDocs = [];

    // Sort by relevance and diversity
    const rankedResults = searchResults.sort((a, b) => {
      // Primary sort by similarity score
      if (b.similarity_score !== a.similarity_score) {
        return b.similarity_score - a.similarity_score;
      }
      // Secondary sort by content quality
      return (b.metadata?.quality_score || 0) - (a.metadata?.quality_score || 0);
    });

    for (const doc of rankedResults) {
      const docLength = (doc.content_snippet || doc.content || '').length;

      if (totalLength + docLength <= maxLength) {
        selectedDocs.push(doc);
        totalLength += docLength;
      }

      if (selectedDocs.length >= 5) break; // Limit to top 5 documents
    }

    return selectedDocs;
  }

  buildRAGPrompt(query, context, includeCitations) {
    return `You are a helpful assistant that answers questions based on the provided context. Use the context information to provide accurate and comprehensive answers.

Context Information:
${context}

Question: ${query}

Instructions:
- Answer based solely on the information provided in the context
- If the context doesn't contain enough information to answer fully, state what information is missing
- Be comprehensive but concise
${includeCitations ? '- Include source citations using the [number] format from the context' : ''}
- If no relevant information is found, clearly state that the context doesn't contain the answer

Answer:`;
  }

  recordSearchMetrics(metrics) {
    const key = `${metrics.query_type}_${Date.now()}`;
    this.performanceMetrics.set(key, metrics);

    // Keep only last 1000 metrics
    if (this.performanceMetrics.size > 1000) {
      const oldestKey = this.performanceMetrics.keys().next().value;
      this.performanceMetrics.delete(oldestKey);
    }
  }

  async logRAGUsage(usage) {
    try {
      await this.collections.searchLogs.insertOne({
        ...usage,
        type: 'rag_pipeline'
      });
    } catch (error) {
      console.warn('Failed to log RAG usage:', error);
    }
  }

  calculateWordCount(text) {
    return (text || '').split(/\s+/).filter(word => word.length > 0).length;
  }

  inferContentType(doc) {
    if (doc.content && doc.content.includes('```')) return 'technical';
    if (doc.title && doc.title.includes('Tutorial')) return 'tutorial';
    if (doc.content && doc.content.length > 2000) return 'long_form';
    return 'standard';
  }

  calculateQualityScore(doc) {
    let score = 0.5; // Base score

    if (doc.title && doc.title.length > 10) score += 0.1;
    if (doc.content && doc.content.length > 500) score += 0.2;
    if (doc.author) score += 0.1;
    if (doc.tags && doc.tags.length > 0) score += 0.1;

    return Math.min(1.0, score);
  }
}

// Benefits of MongoDB Atlas Vector Search:
// - Native integration with MongoDB document model and operations
// - Automatic scaling and management without separate vector database infrastructure  
// - Advanced filtering capabilities combined with vector similarity search
// - Hybrid search combining full-text and vector search capabilities
// - Built-in indexing optimization for high-performance vector operations
// - Integrated analytics and monitoring for vector search performance
// - Real-time updates and dynamic index management
// - Cost-effective scaling with MongoDB Atlas infrastructure
// - Comprehensive security and compliance features
// - SQL-compatible vector operations through QueryLeaf integration

module.exports = {
  AtlasVectorSearchManager
};

Understanding MongoDB Atlas Vector Search Architecture

Advanced Vector Search Patterns for AI Applications

Implement sophisticated vector search patterns for production AI applications:

// Advanced vector search patterns and AI application integration
class ProductionVectorSearchSystem {
  constructor(atlasConfig) {
    this.atlasManager = new AtlasVectorSearchManager(
      atlasConfig.connectionString, 
      atlasConfig.database
    );
    this.embeddingCache = new Map();
    this.searchCache = new Map();
    this.analyticsCollector = new Map();
  }

  async buildIntelligentDocumentProcessor(documents, processingOptions = {}) {
    console.log('Building intelligent document processing pipeline...');

    const {
      chunkSize = 1000,
      chunkOverlap = 200,
      embeddingModel = 'text-embedding-ada-002',
      enableSemanticChunking = true,
      extractKeywords = true,
      analyzeSentiment = true
    } = processingOptions;

    const processedDocuments = [];

    for (const doc of documents) {
      try {
        // Step 1: Intelligent document chunking
        const chunks = enableSemanticChunking ? 
          await this.performSemanticChunking(doc.content, chunkSize, chunkOverlap) :
          this.performFixedChunking(doc.content, chunkSize, chunkOverlap);

        // Step 2: Process each chunk
        for (const [chunkIndex, chunk] of chunks.entries()) {
          const chunkDoc = {
            _id: new ObjectId(),
            parent_document_id: doc._id,
            title: `${doc.title} - Part ${chunkIndex + 1}`,
            content: chunk.text,
            chunk_index: chunkIndex,

            // Chunk metadata
            chunk_metadata: {
              word_count: chunk.word_count,
              sentence_count: chunk.sentence_count,
              start_position: chunk.start_position,
              end_position: chunk.end_position,
              semantic_density: chunk.semantic_density || 0
            },

            // Enhanced metadata processing
            metadata: {
              ...doc.metadata,
              // Keyword extraction
              ...(extractKeywords && {
                keywords: await this.extractKeywords(chunk.text),
                entities: await this.extractEntities(chunk.text)
              }),

              // Sentiment analysis  
              ...(analyzeSentiment && {
                sentiment: await this.analyzeSentiment(chunk.text)
              }),

              // Document structure analysis
              structure_type: this.analyzeDocumentStructure(chunk.text),
              information_density: this.calculateInformationDensity(chunk.text)
            },

            created_at: doc.created_at,
            updated_at: new Date(),
            processing_version: '2.0'
          };

          processedDocuments.push(chunkDoc);
        }

      } catch (error) {
        console.error(`Error processing document ${doc._id}:`, error);
        continue;
      }
    }

    console.log(`Document processing completed: ${processedDocuments.length} chunks created from ${documents.length} documents`);
    return processedDocuments;
  }

  async performSemanticChunking(text, targetSize, overlap) {
    // Implement semantic-aware chunking that preserves meaning
    const sentences = this.splitIntoSentences(text);
    const chunks = [];
    let currentChunk = '';
    let currentWordCount = 0;
    let startPosition = 0;

    for (const sentence of sentences) {
      const sentenceWordCount = sentence.split(/\s+/).length;

      if (currentWordCount + sentenceWordCount > targetSize && currentChunk.length > 0) {
        // Create chunk with semantic coherence
        chunks.push({
          text: currentChunk.trim(),
          word_count: currentWordCount,
          sentence_count: currentChunk.split(/[.!?]+/).length - 1,
          start_position: startPosition,
          end_position: startPosition + currentChunk.length,
          semantic_density: await this.calculateSemanticDensity(currentChunk)
        });

        // Start new chunk with overlap
        const overlapText = this.extractOverlapText(currentChunk, overlap);
        currentChunk = overlapText + ' ' + sentence;
        currentWordCount = this.countWords(currentChunk);
        startPosition += currentChunk.length - overlapText.length;
      } else {
        currentChunk += (currentChunk ? ' ' : '') + sentence;
        currentWordCount += sentenceWordCount;
      }
    }

    // Add final chunk
    if (currentChunk.trim().length > 0) {
      chunks.push({
        text: currentChunk.trim(),
        word_count: currentWordCount,
        sentence_count: currentChunk.split(/[.!?]+/).length - 1,
        start_position: startPosition,
        end_position: startPosition + currentChunk.length,
        semantic_density: await this.calculateSemanticDensity(currentChunk)
      });
    }

    return chunks;
  }

  async buildConversationalRAG(conversationHistory, currentQuery, options = {}) {
    console.log('Building conversational RAG system...');

    const {
      contextWindow = 5,
      includeConversationContext = true,
      personalizeResponse = true,
      userId = null
    } = options;

    try {
      // Step 1: Build conversational context
      let enhancedQuery = currentQuery;

      if (includeConversationContext && conversationHistory.length > 0) {
        const recentContext = conversationHistory.slice(-contextWindow);
        const contextSummary = recentContext.map(turn => 
          `${turn.role}: ${turn.content}`
        ).join('\n');

        enhancedQuery = `Previous conversation context:\n${contextSummary}\n\nCurrent question: ${currentQuery}`;
      }

      // Step 2: Generate enhanced query embedding
      const queryEmbedding = await this.generateEmbedding(enhancedQuery);

      // Step 3: Personalized retrieval if user profile available
      let userProfile = null;
      if (personalizeResponse && userId) {
        userProfile = await this.getUserProfile(userId);
      }

      // Step 4: Perform contextual search
      const searchResults = await this.atlasManager.performSemanticSearch(queryEmbedding, {
        limit: 8,
        userProfile: userProfile,
        boostFactors: {
          recency: 0.2,
          quality: 0.3,
          personalization: 0.2
        }
      });

      // Step 5: Build conversational RAG response
      const ragResponse = await this.atlasManager.buildRAGPipeline(enhancedQuery, {
        contextLimit: 6,
        maxContextLength: 5000,
        embeddingFunction: (texts) => Promise.resolve([queryEmbedding]),
        llmFunction: this.createConversationalLLMFunction(conversationHistory),
        includeSourceCitations: true
      });

      // Step 6: Post-process for conversation continuity
      if (ragResponse.success) {
        ragResponse.conversation_metadata = {
          context_turns_used: Math.min(contextWindow, conversationHistory.length),
          personalized: !!userProfile,
          query_enhanced: includeConversationContext,
          user_id: userId
        };
      }

      return ragResponse;

    } catch (error) {
      console.error('Conversational RAG error:', error);
      return {
        success: false,
        error: error.message,
        query: currentQuery
      };
    }
  }

  createConversationalLLMFunction(conversationHistory) {
    return async (prompt, options = {}) => {
      // Add conversation-aware instructions
      const conversationalPrompt = `You are a helpful assistant engaged in an ongoing conversation. 

Previous conversation context has been provided. Use this context to:
- Maintain conversation continuity
- Reference previous topics when relevant
- Provide contextually appropriate responses
- Acknowledge when building on previous answers

${prompt}

Remember to be conversational and reference the ongoing dialogue when appropriate.`;

      // This would integrate with your preferred LLM service
      return await this.callLLMService(conversationalPrompt, options);
    };
  }

  async implementRecommendationSystem(userId, options = {}) {
    console.log(`Building recommendation system for user ${userId}...`);

    const {
      recommendationType = 'content',
      diversityFactor = 0.3,
      noveltyBoost = 0.2,
      limit = 10
    } = options;

    try {
      // Step 1: Get user profile and interaction history
      const userProfile = await this.getUserProfile(userId);
      const interactionHistory = await this.getUserInteractions(userId);

      // Step 2: Build user preference embedding
      const userPreferenceEmbedding = await this.buildUserPreferenceEmbedding(
        userProfile, 
        interactionHistory
      );

      // Step 3: Find similar content
      const candidateResults = await this.atlasManager.performSemanticSearch(
        userPreferenceEmbedding,
        {
          limit: limit * 3, // Get more candidates for diversity
          similarityThreshold: 0.4
        }
      );

      // Step 4: Apply diversity and novelty filtering
      const diversifiedResults = this.applyDiversityFiltering(
        candidateResults.results,
        interactionHistory,
        diversityFactor,
        noveltyBoost
      );

      // Step 5: Rank final recommendations
      const finalRecommendations = diversifiedResults.slice(0, limit).map((rec, index) => ({
        ...rec,
        recommendation_rank: index + 1,
        recommendation_score: rec.final_score,
        recommendation_reasons: this.generateRecommendationReasons(rec, userProfile)
      }));

      return {
        success: true,
        user_id: userId,
        recommendations: finalRecommendations,
        recommendation_metadata: {
          algorithm: 'vector_similarity_with_diversity',
          diversity_factor: diversityFactor,
          novelty_boost: noveltyBoost,
          candidates_evaluated: candidateResults.results?.length || 0,
          final_count: finalRecommendations.length
        }
      };

    } catch (error) {
      console.error('Recommendation system error:', error);
      return {
        success: false,
        error: error.message,
        user_id: userId
      };
    }
  }

  applyDiversityFiltering(candidates, userHistory, diversityFactor, noveltyBoost) {
    // Track categories and topics to ensure diversity
    const categoryCount = new Map();
    const diversifiedResults = [];

    // Get user's previously interacted content for novelty scoring
    const previouslyViewed = new Set(
      userHistory.map(interaction => interaction.document_id?.toString())
    );

    for (const candidate of candidates) {
      const category = candidate.metadata?.category || 'unknown';
      const currentCategoryCount = categoryCount.get(category) || 0;

      // Calculate diversity penalty (more items in category = higher penalty)
      const diversityPenalty = currentCategoryCount * diversityFactor;

      // Calculate novelty boost (unseen content gets boost)
      const noveltyScore = previouslyViewed.has(candidate._id.toString()) ? 0 : noveltyBoost;

      // Apply adjustments to final score
      candidate.final_score = (candidate.final_score || candidate.similarity_score) - diversityPenalty + noveltyScore;
      candidate.diversity_penalty = diversityPenalty;
      candidate.novelty_boost = noveltyScore;

      diversifiedResults.push(candidate);
      categoryCount.set(category, currentCategoryCount + 1);
    }

    return diversifiedResults.sort((a, b) => b.final_score - a.final_score);
  }

  generateRecommendationReasons(recommendation, userProfile) {
    const reasons = [];

    if (userProfile.preferred_categories?.includes(recommendation.metadata?.category)) {
      reasons.push(`Matches your interest in ${recommendation.metadata.category}`);
    }

    if (recommendation.similarity_score > 0.8) {
      reasons.push('Highly relevant to your preferences');
    }

    if (recommendation.novelty_boost > 0) {
      reasons.push('New content you haven\'t seen');
    }

    if (recommendation.metadata?.quality_score > 0.8) {
      reasons.push('High-quality content');
    }

    return reasons.length > 0 ? reasons : ['Recommended based on your profile'];
  }

  // Utility methods
  splitIntoSentences(text) {
    return text.split(/[.!?]+/).filter(s => s.trim().length > 0);
  }

  extractOverlapText(text, overlapSize) {
    const words = text.split(/\s+/);
    return words.slice(-overlapSize).join(' ');
  }

  countWords(text) {
    return text.split(/\s+/).filter(word => word.length > 0).length;
  }

  async calculateSemanticDensity(text) {
    // Simplified semantic density calculation
    const sentences = this.splitIntoSentences(text);
    const avgSentenceLength = text.length / sentences.length;
    const wordCount = this.countWords(text);

    // Higher density = more information per word
    return Math.min(1.0, (avgSentenceLength / 100) * (wordCount / 500));
  }

  analyzeDocumentStructure(text) {
    if (text.includes('```') || text.includes('function') || text.includes('class')) return 'code';
    if (text.match(/^\d+\./m) || text.includes('Step')) return 'procedural';
    if (text.includes('?') && text.split('?').length > 2) return 'faq';
    return 'narrative';
  }

  calculateInformationDensity(text) {
    const uniqueWords = new Set(text.toLowerCase().match(/\b\w+\b/g) || []);
    const totalWords = this.countWords(text);
    return totalWords > 0 ? uniqueWords.size / totalWords : 0;
  }
}

SQL-Style Vector Search Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB Atlas Vector Search operations:

-- QueryLeaf vector search operations with SQL-familiar syntax

-- Create vector search enabled collection
CREATE COLLECTION documents_with_vectors (
  _id OBJECTID PRIMARY KEY,
  title VARCHAR(500) NOT NULL,
  content TEXT NOT NULL,

  -- Vector embedding field
  embedding VECTOR(1536) NOT NULL, -- OpenAI embedding dimensions

  -- Metadata for filtering
  category VARCHAR(100),
  language VARCHAR(10) DEFAULT 'en',
  source VARCHAR(100),
  tags VARCHAR[] DEFAULT ARRAY[]::VARCHAR[],
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

  -- Document analysis fields
  word_count INTEGER,
  reading_time_minutes INTEGER,
  quality_score DECIMAL(3,2) DEFAULT 0.5,

  -- Full-text search support
  searchable_text TEXT GENERATED ALWAYS AS (title || ' ' || content) STORED
);

-- Create Atlas Vector Search index
CREATE VECTOR INDEX document_semantic_search ON documents_with_vectors (
  embedding USING cosine_similarity
  WITH FILTER FIELDS (category, language, source, created_at, tags)
);

-- Create hybrid search index for text + vector
CREATE SEARCH INDEX document_hybrid_search ON documents_with_vectors (
  title WITH lucene_analyzer('standard'),
  content WITH lucene_analyzer('english'),
  category WITH string_facet(),
  tags WITH string_facet()
);

-- Semantic vector search with SQL syntax
SELECT 
  _id,
  title,
  LEFT(content, 300) as content_preview,
  category,
  source,
  created_at,

  -- Vector similarity score
  VECTOR_SIMILARITY(embedding, $1::VECTOR(1536), 'cosine') as similarity_score,

  -- Relevance classification
  CASE 
    WHEN VECTOR_SIMILARITY(embedding, $1, 'cosine') >= 0.9 THEN 'highly_relevant'
    WHEN VECTOR_SIMILARITY(embedding, $1, 'cosine') >= 0.8 THEN 'relevant'
    WHEN VECTOR_SIMILARITY(embedding, $1, 'cosine') >= 0.7 THEN 'somewhat_relevant'
    ELSE 'marginally_relevant'
  END as relevance_category,

  -- Quality-adjusted ranking score
  VECTOR_SIMILARITY(embedding, $1, 'cosine') * (1 + quality_score * 0.2) as final_score

FROM documents_with_vectors
WHERE 
  -- Vector similarity threshold
  VECTOR_SIMILARITY(embedding, $1, 'cosine') >= $2::DECIMAL -- similarity threshold parameter

  -- Optional metadata filtering
  AND ($3::VARCHAR[] IS NULL OR category = ANY($3)) -- categories filter
  AND ($4::VARCHAR IS NULL OR language = $4) -- language filter  
  AND ($5::VARCHAR IS NULL OR source = $5) -- source filter
  AND ($6::VARCHAR[] IS NULL OR tags && $6) -- tags overlap filter
  AND ($7::TIMESTAMP IS NULL OR created_at >= $7) -- date filter

ORDER BY final_score DESC, similarity_score DESC
LIMIT $8::INTEGER; -- result limit

-- Advanced hybrid search combining vector and text similarity
WITH vector_search AS (
  SELECT 
    _id, title, content, category, source, created_at,
    VECTOR_SIMILARITY(embedding, $1::VECTOR(1536), 'cosine') as vector_score
  FROM documents_with_vectors
  WHERE VECTOR_SIMILARITY(embedding, $1, 'cosine') >= 0.6
  ORDER BY vector_score DESC
  LIMIT 20
),

text_search AS (
  SELECT 
    _id, title, content, category, source, created_at,
    SEARCH_SCORE() as text_score,
    SEARCH_HIGHLIGHTS('content', 3) as highlighted_content
  FROM documents_with_vectors
  WHERE MATCH(searchable_text, $2::TEXT) -- text query parameter
    WITH search_options(
      fuzzy_max_edits = 2,
      fuzzy_prefix_length = 3,
      highlight_max_chars = 1000
    )
  ORDER BY text_score DESC
  LIMIT 20
),

hybrid_results AS (
  SELECT 
    COALESCE(vs._id, ts._id) as _id,
    COALESCE(vs.title, ts.title) as title,
    COALESCE(vs.content, ts.content) as content,
    COALESCE(vs.category, ts.category) as category,
    COALESCE(vs.source, ts.source) as source,
    COALESCE(vs.created_at, ts.created_at) as created_at,

    -- Normalize scores to 0-1 range
    COALESCE(vs.vector_score, 0) / (SELECT MAX(vector_score) FROM vector_search) as normalized_vector_score,
    COALESCE(ts.text_score, 0) / (SELECT MAX(text_score) FROM text_search) as normalized_text_score,

    -- Hybrid scoring with configurable weights
    ($3::DECIMAL * COALESCE(vs.vector_score, 0) / (SELECT MAX(vector_score) FROM vector_search)) + 
    ($4::DECIMAL * COALESCE(ts.text_score, 0) / (SELECT MAX(text_score) FROM text_search)) as hybrid_score,

    ts.highlighted_content,

    -- Search type classification
    CASE 
      WHEN vs._id IS NOT NULL AND ts._id IS NOT NULL THEN 'both'
      WHEN vs._id IS NOT NULL THEN 'vector_only'
      ELSE 'text_only'
    END as search_type

  FROM vector_search vs
  FULL OUTER JOIN text_search ts ON vs._id = ts._id
)

SELECT 
  _id,
  title,
  LEFT(content, 400) as content_preview,
  category,
  source,
  created_at,

  -- Scores
  ROUND(normalized_vector_score::NUMERIC, 4) as vector_similarity,
  ROUND(normalized_text_score::NUMERIC, 4) as text_relevance, 
  ROUND(hybrid_score::NUMERIC, 4) as final_score,

  search_type,
  highlighted_content,

  -- Content insights
  CASE 
    WHEN hybrid_score >= 0.8 THEN 'excellent_match'
    WHEN hybrid_score >= 0.6 THEN 'good_match' 
    WHEN hybrid_score >= 0.4 THEN 'fair_match'
    ELSE 'weak_match'
  END as match_quality

FROM hybrid_results
ORDER BY hybrid_score DESC, normalized_vector_score DESC
LIMIT $5::INTEGER; -- final result limit

-- Retrieval-Augmented Generation (RAG) pipeline with QueryLeaf
WITH context_retrieval AS (
  SELECT 
    _id,
    title,
    content,
    category,
    VECTOR_SIMILARITY(embedding, $1::VECTOR(1536), 'cosine') as relevance_score
  FROM documents_with_vectors
  WHERE VECTOR_SIMILARITY(embedding, $1, 'cosine') >= 0.7
  ORDER BY relevance_score DESC
  LIMIT 5
),

context_preparation AS (
  SELECT 
    STRING_AGG(
      '[' || ROW_NUMBER() OVER (ORDER BY relevance_score DESC) || '] ' || 
      title || E'\n' || LEFT(content, 500) || '...',
      E'\n\n'
      ORDER BY relevance_score DESC
    ) as context_string,

    COUNT(*) as context_documents,
    AVG(relevance_score) as avg_relevance,

    JSON_AGG(
      JSON_BUILD_OBJECT(
        'id', ROW_NUMBER() OVER (ORDER BY relevance_score DESC),
        'title', title,
        'category', category,
        'relevance', ROUND(relevance_score::NUMERIC, 4)
      ) ORDER BY relevance_score DESC
    ) as source_citations

  FROM context_retrieval
)

SELECT 
  context_string,
  context_documents,
  ROUND(avg_relevance::NUMERIC, 4) as average_context_relevance,
  source_citations,

  -- RAG prompt construction
  'You are a helpful assistant that answers questions based on provided context. ' ||
  'Use the following context information to provide accurate answers.' || E'\n\n' ||
  'Context Information:' || E'\n' || context_string || E'\n\n' ||
  'Question: ' || $2::TEXT || E'\n\n' ||
  'Instructions:' || E'\n' ||
  '- Answer based solely on the provided context' || E'\n' ||  
  '- Include source citations using [number] format' || E'\n' ||
  '- If context is insufficient, clearly state what information is missing' || E'\n\n' ||
  'Answer:' as rag_prompt,

  -- Query metadata
  $2::TEXT as original_query,
  CURRENT_TIMESTAMP as generated_at

FROM context_preparation;

-- User preference-based semantic search and recommendations  
WITH user_profile AS (
  SELECT 
    user_id,
    preference_embedding,
    preferred_categories,
    preferred_languages,
    interaction_history,
    last_active
  FROM user_profiles
  WHERE user_id = $1::UUID
),

personalized_search AS (
  SELECT 
    d._id,
    d.title,
    d.content,
    d.category,
    d.source,
    d.created_at,
    d.quality_score,

    -- Semantic similarity to user preferences
    VECTOR_SIMILARITY(d.embedding, up.preference_embedding, 'cosine') as preference_similarity,

    -- Category preference boost
    CASE 
      WHEN d.category = ANY(up.preferred_categories) THEN 1.2
      ELSE 1.0
    END as category_boost,

    -- Novelty boost (content user hasn't seen)
    CASE 
      WHEN d._id = ANY(up.interaction_history) THEN 0.8 -- Reduce score for seen content
      ELSE 1.1 -- Boost novel content
    END as novelty_boost,

    -- Recency factor
    CASE 
      WHEN d.created_at >= CURRENT_DATE - INTERVAL '7 days' THEN 1.1
      WHEN d.created_at >= CURRENT_DATE - INTERVAL '30 days' THEN 1.05
      ELSE 1.0  
    END as recency_boost

  FROM documents_with_vectors d
  CROSS JOIN user_profile up
  WHERE VECTOR_SIMILARITY(d.embedding, up.preference_embedding, 'cosine') >= 0.5
    AND (up.preferred_languages IS NULL OR d.language = ANY(up.preferred_languages))
),

ranked_recommendations AS (
  SELECT *,
    -- Calculate final personalized score
    preference_similarity * category_boost * novelty_boost * recency_boost * (1 + quality_score * 0.3) as personalized_score,

    -- Diversity scoring to avoid over-concentration in single category
    ROW_NUMBER() OVER (PARTITION BY category ORDER BY preference_similarity DESC) as category_rank

  FROM personalized_search
),

diversified_recommendations AS (
  SELECT *,
    -- Apply diversity penalty for category concentration
    CASE 
      WHEN category_rank <= 2 THEN personalized_score
      WHEN category_rank <= 4 THEN personalized_score * 0.9
      ELSE personalized_score * 0.7
    END as final_recommendation_score

  FROM ranked_recommendations
)

SELECT 
  _id,
  title,
  LEFT(content, 300) as content_preview,
  category,
  source,
  created_at,

  -- Recommendation scores
  ROUND(preference_similarity::NUMERIC, 4) as user_preference_match,
  ROUND(personalized_score::NUMERIC, 4) as personalized_relevance,
  ROUND(final_recommendation_score::NUMERIC, 4) as recommendation_score,

  -- Recommendation explanations
  CASE 
    WHEN category_boost > 1.0 AND novelty_boost > 1.0 THEN 'New content in your preferred categories'
    WHEN category_boost > 1.0 THEN 'Matches your category preferences'
    WHEN novelty_boost > 1.0 THEN 'New content you might find interesting'
    WHEN recency_boost > 1.0 THEN 'Recently published content'
    ELSE 'Recommended based on your preferences'
  END as recommendation_reason,

  -- Quality indicators
  CASE 
    WHEN quality_score >= 0.8 AND preference_similarity >= 0.8 THEN 'high_confidence'
    WHEN quality_score >= 0.6 AND preference_similarity >= 0.6 THEN 'medium_confidence'
    ELSE 'exploratory'
  END as confidence_level

FROM diversified_recommendations
ORDER BY final_recommendation_score DESC, preference_similarity DESC  
LIMIT $2::INTEGER; -- recommendation count limit

-- Real-time vector search analytics and performance monitoring
CREATE MATERIALIZED VIEW vector_search_analytics AS
WITH search_performance AS (
  SELECT 
    DATE_TRUNC('hour', search_timestamp) as hour_bucket,
    search_type, -- 'vector', 'text', 'hybrid'

    -- Performance metrics
    COUNT(*) as search_count,
    AVG(search_duration_ms) as avg_search_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY search_duration_ms) as p95_search_time,
    AVG(result_count) as avg_results_returned,

    -- Quality metrics  
    AVG(avg_similarity_score) as avg_result_relevance,
    COUNT(*) FILTER (WHERE avg_similarity_score >= 0.8) as high_relevance_searches,
    COUNT(*) FILTER (WHERE result_count = 0) as zero_result_searches,

    -- User interaction metrics
    COUNT(DISTINCT user_id) as unique_users,
    AVG(user_interaction_score) as avg_user_satisfaction

  FROM search_logs
  WHERE search_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  GROUP BY DATE_TRUNC('hour', search_timestamp), search_type
),

embedding_performance AS (
  SELECT 
    DATE_TRUNC('hour', created_at) as hour_bucket,
    embedding_model,

    -- Embedding metrics
    COUNT(*) as embeddings_generated,
    AVG(embedding_generation_time_ms) as avg_embedding_time,
    AVG(ARRAY_LENGTH(embedding, 1)) as avg_dimensions -- Vector dimension validation

  FROM documents_with_vectors
  WHERE created_at >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  GROUP BY DATE_TRUNC('hour', created_at), embedding_model
)

SELECT 
  sp.hour_bucket,
  sp.search_type,

  -- Volume metrics
  sp.search_count,
  sp.unique_users,
  ROUND((sp.search_count::DECIMAL / sp.unique_users)::NUMERIC, 2) as searches_per_user,

  -- Performance metrics
  ROUND(sp.avg_search_time::NUMERIC, 2) as avg_search_time_ms,
  ROUND(sp.p95_search_time::NUMERIC, 2) as p95_search_time_ms,
  sp.avg_results_returned,

  -- Quality metrics
  ROUND(sp.avg_result_relevance::NUMERIC, 3) as avg_relevance_score,
  ROUND((sp.high_relevance_searches::DECIMAL / sp.search_count * 100)::NUMERIC, 1) as high_relevance_rate_pct,
  ROUND((sp.zero_result_searches::DECIMAL / sp.search_count * 100)::NUMERIC, 1) as zero_results_rate_pct,

  -- User satisfaction
  ROUND(sp.avg_user_satisfaction::NUMERIC, 2) as user_satisfaction_score,

  -- Embedding performance (when available)
  ep.embeddings_generated,
  ep.avg_embedding_time,

  -- Health indicators
  CASE 
    WHEN sp.avg_search_time <= 100 AND sp.avg_result_relevance >= 0.7 THEN 'healthy'
    WHEN sp.avg_search_time <= 500 AND sp.avg_result_relevance >= 0.5 THEN 'acceptable'
    ELSE 'needs_attention'
  END as system_health_status,

  -- Recommendations
  CASE 
    WHEN sp.zero_result_searches::DECIMAL / sp.search_count > 0.1 THEN 'Improve embedding coverage'
    WHEN sp.avg_search_time > 1000 THEN 'Optimize vector indexes'
    WHEN sp.avg_result_relevance < 0.6 THEN 'Review similarity thresholds'
    ELSE 'Performance within targets'
  END as optimization_recommendation

FROM search_performance sp
LEFT JOIN embedding_performance ep ON sp.hour_bucket = ep.hour_bucket
ORDER BY sp.hour_bucket DESC, sp.search_type;

-- QueryLeaf provides comprehensive Atlas Vector Search capabilities:
-- 1. SQL-familiar vector search syntax with similarity functions
-- 2. Advanced hybrid search combining vector and full-text capabilities  
-- 3. Built-in RAG pipeline construction with context retrieval and ranking
-- 4. Personalized recommendation systems with user preference integration
-- 5. Real-time analytics and performance monitoring for vector operations
-- 6. Automatic embedding management and vector index optimization
-- 7. Conversational AI support with context-aware search capabilities
-- 8. Production-scale vector search with filtering and metadata integration
-- 9. Comprehensive search quality metrics and optimization recommendations
-- 10. Native integration with MongoDB Atlas Vector Search infrastructure

Best Practices for Atlas Vector Search Implementation

Vector Index Design and Optimization

Essential practices for production Atlas Vector Search deployments:

Vector Dimensionality: Choose embedding dimensions based on model requirements and performance constraints
Similarity Metrics: Select appropriate similarity functions (cosine, euclidean, dot product) for your use case
Index Configuration: Configure vector indexes with optimal numCandidates and filter field selections
Metadata Strategy: Design metadata schemas that enable efficient filtering during vector search
Embedding Quality: Implement embedding generation strategies that capture semantic meaning effectively
Performance Monitoring: Deploy comprehensive monitoring for search latency, accuracy, and user satisfaction

Production AI Application Patterns

Optimize Atlas Vector Search for real-world AI applications:

Hybrid Search: Combine vector similarity with traditional search for comprehensive results
RAG Optimization: Implement context selection strategies that balance relevance and diversity
Real-time Updates: Design pipelines for incremental embedding updates and index maintenance
Personalization: Build user preference models that enhance search relevance
Cost Management: Optimize embedding generation and storage costs through intelligent caching
Security Integration: Implement proper authentication and access controls for vector data

Conclusion

MongoDB Atlas Vector Search provides a comprehensive platform for building modern AI applications that require sophisticated semantic search capabilities. By integrating vector search directly into MongoDB's document model, developers can build powerful AI systems without the complexity of managing separate vector databases.

Key Atlas Vector Search benefits include:

Native Integration: Seamless combination of document operations and vector search in a single platform
Scalable Architecture: Built on MongoDB Atlas infrastructure with automatic scaling and management
Hybrid Capabilities: Advanced search patterns combining vector similarity with traditional text search
AI-Ready Features: Built-in support for RAG pipelines, personalization, and conversational AI
Production Optimized: Enterprise-grade security, monitoring, and performance optimization
Developer Friendly: Familiar MongoDB query patterns extended with vector search capabilities

Whether you're building recommendation systems, semantic search engines, RAG-powered chatbots, or other AI applications, MongoDB Atlas Vector Search with QueryLeaf's SQL-familiar interface provides the foundation for modern AI-powered applications that scale efficiently and maintain high performance.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB Atlas Vector Search operations while providing SQL-familiar syntax for semantic search, hybrid search patterns, and RAG pipeline construction. Advanced vector search capabilities, personalization systems, and AI application patterns are seamlessly accessible through familiar SQL constructs, making sophisticated AI development both powerful and approachable for SQL-oriented teams.

The combination of MongoDB's flexible document model with advanced vector search capabilities makes it an ideal platform for AI applications that require both semantic understanding and operational flexibility, ensuring your AI systems can evolve with advancing technology while maintaining familiar development patterns.

November 11, 2025
26 min read

MongoDB Index Optimization and Query Performance Tuning: Advanced Database Performance Engineering

Modern enterprise applications demand exceptional database performance to support millions of users, complex queries, and real-time analytics workloads. Traditional approaches to database performance optimization often rely on rigid indexing strategies, manual query tuning, and reactive performance monitoring that fails to scale with growing data volumes and evolving access patterns.

MongoDB's flexible indexing system provides comprehensive performance optimization capabilities that combine intelligent index selection, advanced compound indexing strategies, and sophisticated query execution analysis. Unlike traditional database systems that require extensive manual tuning, MongoDB's index optimization features enable proactive performance management with automated recommendations, flexible indexing patterns, and detailed performance analytics.

The Traditional Database Performance Challenge

Relational database performance optimization has significant complexity and maintenance overhead:

-- Traditional PostgreSQL performance optimization - complex and manual

-- Customer orders table with performance challenges
CREATE TABLE customer_orders (
    order_id BIGSERIAL PRIMARY KEY,
    customer_id BIGINT NOT NULL,
    order_date TIMESTAMP NOT NULL,
    order_status VARCHAR(20) NOT NULL,
    total_amount DECIMAL(12,2) NOT NULL,
    shipping_address_id BIGINT,
    billing_address_id BIGINT,
    payment_method VARCHAR(50),
    shipping_method VARCHAR(50),
    order_priority VARCHAR(20) DEFAULT 'standard',
    sales_rep_id BIGINT,

    -- Additional fields for complex queries
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    completed_at TIMESTAMP,
    cancelled_at TIMESTAMP,

    -- Foreign key constraints
    CONSTRAINT fk_customer FOREIGN KEY (customer_id) REFERENCES customers(customer_id),
    CONSTRAINT fk_shipping_address FOREIGN KEY (shipping_address_id) REFERENCES addresses(address_id),
    CONSTRAINT fk_billing_address FOREIGN KEY (billing_address_id) REFERENCES addresses(address_id),
    CONSTRAINT fk_sales_rep FOREIGN KEY (sales_rep_id) REFERENCES employees(employee_id)
);

-- Order items table for line-level details
CREATE TABLE order_items (
    item_id BIGSERIAL PRIMARY KEY,
    order_id BIGINT NOT NULL,
    product_id BIGINT NOT NULL,
    quantity INTEGER NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL,
    discount_amount DECIMAL(10,2) DEFAULT 0.00,
    tax_amount DECIMAL(10,2) NOT NULL,

    CONSTRAINT fk_order FOREIGN KEY (order_id) REFERENCES customer_orders(order_id),
    CONSTRAINT fk_product FOREIGN KEY (product_id) REFERENCES products(product_id)
);

-- Manual index creation - requires extensive analysis and planning
-- Basic indexes for common queries
CREATE INDEX idx_orders_customer_id ON customer_orders(customer_id);
CREATE INDEX idx_orders_order_date ON customer_orders(order_date);
CREATE INDEX idx_orders_status ON customer_orders(order_status);

-- Compound indexes for complex query patterns
CREATE INDEX idx_orders_customer_date ON customer_orders(customer_id, order_date DESC);
CREATE INDEX idx_orders_status_date ON customer_orders(order_status, order_date DESC);
CREATE INDEX idx_orders_rep_status ON customer_orders(sales_rep_id, order_status, order_date DESC);

-- Partial indexes for selective filtering
CREATE INDEX idx_orders_completed_recent ON customer_orders(completed_at, total_amount) 
    WHERE order_status = 'completed' AND completed_at >= CURRENT_DATE - INTERVAL '90 days';

-- Covering indexes for query optimization (include columns)
CREATE INDEX idx_orders_customer_covering ON customer_orders(customer_id, order_date DESC) 
    INCLUDE (order_status, total_amount, payment_method);

-- Complex multi-table query requiring careful index planning
SELECT 
    o.order_id,
    o.order_date,
    o.total_amount,
    o.order_status,
    c.customer_name,
    c.customer_email,

    -- Aggregated order items (expensive without proper indexes)
    COUNT(oi.item_id) as item_count,
    SUM(oi.quantity * oi.unit_price) as items_subtotal,
    SUM(oi.discount_amount) as total_discount,
    SUM(oi.tax_amount) as total_tax,

    -- Product information (requires additional joins)
    array_agg(DISTINCT p.product_name) as product_names,
    array_agg(DISTINCT p.category) as product_categories,

    -- Address information (more joins)
    sa.street_address as shipping_street,
    sa.city as shipping_city,
    sa.state as shipping_state,

    -- Employee information
    e.first_name || ' ' || e.last_name as sales_rep_name

FROM customer_orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
LEFT JOIN addresses sa ON o.shipping_address_id = sa.address_id
LEFT JOIN employees e ON o.sales_rep_id = e.employee_id

WHERE 
    o.order_date >= CURRENT_DATE - INTERVAL '30 days'
    AND o.order_status IN ('processing', 'shipped', 'delivered')
    AND o.total_amount >= 100.00
    AND c.customer_tier IN ('premium', 'enterprise')

GROUP BY 
    o.order_id, o.order_date, o.total_amount, o.order_status,
    c.customer_name, c.customer_email,
    sa.street_address, sa.city, sa.state,
    e.first_name, e.last_name

HAVING COUNT(oi.item_id) >= 2

ORDER BY o.order_date DESC, o.total_amount DESC
LIMIT 100;

-- Analyze query performance (complex interpretation required)
EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) 
SELECT o.order_id, o.total_amount, c.customer_name
FROM customer_orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '7 days'
    AND o.order_status = 'completed'
    AND c.customer_tier = 'premium'
ORDER BY o.total_amount DESC
LIMIT 50;

-- Performance monitoring queries (complex and manual)
SELECT 
    schemaname,
    tablename,
    attname as column_name,
    n_distinct,
    correlation,
    most_common_vals,
    most_common_freqs
FROM pg_stats 
WHERE schemaname = 'public' 
    AND tablename IN ('customer_orders', 'order_items')
ORDER BY tablename, attname;

-- Index usage statistics
SELECT 
    schemaname,
    tablename,
    indexname,
    idx_tup_read,
    idx_tup_fetch,
    idx_scan,

    -- Index efficiency calculation
    CASE 
        WHEN idx_scan > 0 THEN ROUND((idx_tup_fetch::numeric / idx_scan), 2)
        ELSE 0 
    END as avg_tuples_per_scan,

    -- Index selectivity (estimated)
    CASE 
        WHEN idx_tup_read > 0 THEN ROUND((idx_tup_fetch::numeric / idx_tup_read) * 100, 2)
        ELSE 0 
    END as selectivity_percent

FROM pg_stat_user_indexes 
WHERE schemaname = 'public'
ORDER BY idx_scan DESC;

-- Problems with traditional PostgreSQL performance optimization:
-- 1. Manual index design requires deep expertise and continuous maintenance
-- 2. Query plan analysis is complex and difficult to interpret
-- 3. Index maintenance overhead grows with data volume
-- 4. Limited support for dynamic query patterns and evolving schemas
-- 5. Difficult to optimize across multiple tables and complex joins
-- 6. Performance monitoring requires custom scripts and manual interpretation
-- 7. Index selection strategies are static and don't adapt to changing workloads
-- 8. Covering index management is complex and error-prone
-- 9. Partial index design requires detailed knowledge of data distribution
-- 10. Limited automated recommendations for performance improvements

MongoDB provides comprehensive performance optimization with intelligent indexing:

// MongoDB Index Optimization - intelligent and automated performance tuning
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('ecommerce_platform');

// Advanced Index Management and Optimization
class MongoDBIndexOptimizer {
  constructor(db) {
    this.db = db;
    this.performanceMetrics = new Map();
    this.indexRecommendations = new Map();
  }

  async createOptimizedCollections() {
    console.log('Creating optimized collections with intelligent indexing...');

    // Orders collection with comprehensive document structure
    const ordersCollection = db.collection('orders');

    // Sample order document structure for index planning
    const sampleOrder = {
      _id: new ObjectId(),
      orderNumber: "ORD-2025-001234",

      // Customer information (embedded for performance)
      customer: {
        customerId: new ObjectId("64a1b2c3d4e5f6789012345a"),
        name: "John Doe",
        email: "john.doe@example.com",
        tier: "premium", // standard, premium, enterprise
        accountType: "individual" // individual, business
      },

      // Order details
      orderDate: new Date("2025-11-11T10:30:00Z"),
      status: "processing", // pending, processing, shipped, delivered, cancelled
      priority: "standard", // standard, expedited, overnight

      // Financial information
      totals: {
        subtotal: 299.97,
        tax: 24.00,
        shipping: 12.99,
        discount: 15.00,
        grandTotal: 321.96,
        currency: "USD"
      },

      // Items array (embedded for query performance)
      items: [
        {
          productId: new ObjectId("64b2c3d4e5f6789012345b1a"),
          sku: "WIDGET-001",
          name: "Premium Widget",
          category: "electronics",
          subcategory: "gadgets",
          quantity: 2,
          unitPrice: 99.99,
          totalPrice: 199.98,

          // Product attributes for filtering
          attributes: {
            brand: "TechCorp",
            model: "WG-2024",
            color: "black",
            size: null,
            weight: 1.2
          }
        },
        {
          productId: new ObjectId("64b2c3d4e5f6789012345b1b"),
          sku: "ACCESSORY-001", 
          name: "Widget Accessory",
          category: "electronics",
          subcategory: "accessories",
          quantity: 1,
          unitPrice: 99.99,
          totalPrice: 99.99,

          attributes: {
            brand: "TechCorp",
            model: "AC-2024",
            color: "silver",
            compatibility: ["WG-2024", "WG-2023"]
          }
        }
      ],

      // Address information
      addresses: {
        shipping: {
          name: "John Doe",
          street: "123 Main Street",
          city: "San Francisco",
          state: "CA",
          postalCode: "94105",
          country: "US",
          coordinates: {
            latitude: 37.7749,
            longitude: -122.4194
          }
        },

        billing: {
          name: "John Doe",
          street: "123 Main Street", 
          city: "San Francisco",
          state: "CA",
          postalCode: "94105",
          country: "US"
        }
      },

      // Payment information
      payment: {
        method: "credit_card", // credit_card, debit_card, paypal, etc.
        provider: "stripe",
        transactionId: "txn_1234567890",
        status: "captured" // pending, authorized, captured, failed
      },

      // Shipping information
      shipping: {
        method: "standard", // standard, expedited, overnight
        carrier: "UPS",
        trackingNumber: "1Z12345E1234567890",
        estimatedDelivery: new Date("2025-11-15T18:00:00Z"),
        actualDelivery: null
      },

      // Sales and marketing
      salesInfo: {
        salesRepId: new ObjectId("64c3d4e5f67890123456c2a"),
        salesRepName: "Jane Smith",
        channel: "online", // online, phone, in_store
        source: "organic", // organic, paid_search, social, email
        campaign: "holiday_2025"
      },

      // Operational metadata
      fulfillment: {
        warehouseId: "WH-SF-001",
        pickingStarted: null,
        pickingCompleted: null,
        packingStarted: null,
        packingCompleted: null,
        shippedAt: null
      },

      // Analytics and business intelligence
      analytics: {
        customerLifetimeValue: 1250.00,
        orderFrequency: "monthly",
        seasonality: "Q4",
        profitMargin: 0.35,
        riskScore: 12 // fraud risk score 0-100
      },

      // Audit trail
      audit: {
        createdAt: new Date("2025-11-11T10:30:00Z"),
        updatedAt: new Date("2025-11-11T14:45:00Z"),
        createdBy: "system",
        updatedBy: "user_12345",
        version: 2,

        // Change history for critical fields
        statusHistory: [
          {
            status: "pending",
            timestamp: new Date("2025-11-11T10:30:00Z"),
            userId: "customer_67890"
          },
          {
            status: "processing", 
            timestamp: new Date("2025-11-11T14:45:00Z"),
            userId: "system"
          }
        ]
      }
    };

    // Insert sample data for index testing
    await ordersCollection.insertOne(sampleOrder);

    // Create comprehensive index strategy
    await this.createIntelligentIndexes(ordersCollection);

    return ordersCollection;
  }

  async createIntelligentIndexes(collection) {
    console.log('Creating intelligent index strategy...');

    try {
      // 1. Primary query patterns - single field indexes
      await collection.createIndexes([

        // Customer-based queries (most common pattern)
        {
          key: { "customer.customerId": 1 },
          name: "idx_customer_id",
          background: true
        },

        // Date-based queries for reporting
        {
          key: { "orderDate": -1 },
          name: "idx_order_date_desc", 
          background: true
        },

        // Status queries for operational workflows
        {
          key: { "status": 1 },
          name: "idx_status",
          background: true
        },

        // Order number lookups (unique)
        {
          key: { "orderNumber": 1 },
          name: "idx_order_number",
          unique: true,
          background: true
        }
      ]);

      // 2. Compound indexes for complex query patterns
      await collection.createIndexes([

        // Customer order history (most frequent compound query)
        {
          key: { 
            "customer.customerId": 1, 
            "orderDate": -1,
            "status": 1
          },
          name: "idx_customer_date_status",
          background: true
        },

        // Order fulfillment workflow
        {
          key: {
            "status": 1,
            "priority": 1,
            "orderDate": 1
          },
          name: "idx_fulfillment_workflow",
          background: true
        },

        // Financial reporting and analytics
        {
          key: {
            "orderDate": -1,
            "totals.grandTotal": -1,
            "customer.tier": 1
          },
          name: "idx_financial_reporting",
          background: true
        },

        // Sales rep performance tracking
        {
          key: {
            "salesInfo.salesRepId": 1,
            "orderDate": -1,
            "status": 1
          },
          name: "idx_sales_rep_performance",
          background: true
        },

        // Geographic analysis
        {
          key: {
            "addresses.shipping.state": 1,
            "addresses.shipping.city": 1,
            "orderDate": -1
          },
          name: "idx_geographic_analysis",
          background: true
        }
      ]);

      // 3. Specialized indexes for advanced query patterns
      await collection.createIndexes([

        // Text search across multiple fields
        {
          key: {
            "customer.name": "text",
            "customer.email": "text", 
            "orderNumber": "text",
            "items.name": "text",
            "items.sku": "text"
          },
          name: "idx_text_search",
          background: true
        },

        // Geospatial index for location-based queries
        {
          key: { "addresses.shipping.coordinates": "2dsphere" },
          name: "idx_shipping_location",
          background: true
        },

        // Sparse index for optional tracking numbers
        {
          key: { "shipping.trackingNumber": 1 },
          name: "idx_tracking_number",
          sparse: true,
          background: true
        },

        // Partial index for recent high-value orders
        {
          key: { 
            "orderDate": -1,
            "totals.grandTotal": -1 
          },
          name: "idx_recent_high_value",
          partialFilterExpression: {
            "orderDate": { $gte: new Date(Date.now() - 90 * 24 * 60 * 60 * 1000) },
            "totals.grandTotal": { $gte: 500 }
          },
          background: true
        }
      ]);

      // 4. Array indexing for embedded documents
      await collection.createIndexes([

        // Product-based queries on order items
        {
          key: { "items.productId": 1 },
          name: "idx_product_id",
          background: true
        },

        // SKU lookups
        {
          key: { "items.sku": 1 },
          name: "idx_item_sku",
          background: true
        },

        // Category-based analytics
        {
          key: { 
            "items.category": 1,
            "items.subcategory": 1,
            "orderDate": -1
          },
          name: "idx_category_analytics",
          background: true
        },

        // Brand analysis
        {
          key: { "items.attributes.brand": 1 },
          name: "idx_brand_analysis",
          background: true
        }
      ]);

      // 5. TTL index for data lifecycle management
      await collection.createIndex(
        { "audit.createdAt": 1 },
        { 
          name: "idx_ttl_cleanup",
          expireAfterSeconds: 60 * 60 * 24 * 365 * 7, // 7 years retention
          background: true
        }
      );

      console.log('Intelligent indexing strategy implemented successfully');

    } catch (error) {
      console.error('Error creating indexes:', error);
      throw error;
    }
  }

  async analyzeQueryPerformance(collection, queryPattern, options = {}) {
    console.log('Analyzing query performance with advanced explain plans...');

    try {
      // Sample query patterns for analysis
      const queryPatterns = {
        customerOrders: {
          filter: { "customer.customerId": new ObjectId("64a1b2c3d4e5f6789012345a") },
          sort: { "orderDate": -1 },
          limit: 20
        },

        recentHighValue: {
          filter: {
            "orderDate": { $gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) },
            "totals.grandTotal": { $gte: 100 },
            "status": { $in: ["processing", "shipped", "delivered"] }
          },
          sort: { "totals.grandTotal": -1 },
          limit: 50
        },

        fulfillmentQueue: {
          filter: {
            "status": "processing",
            "priority": { $in: ["expedited", "overnight"] }
          },
          sort: { "orderDate": 1 },
          limit: 100
        },

        salesAnalytics: {
          filter: {
            "salesInfo.salesRepId": new ObjectId("64c3d4e5f67890123456c2a"),
            "orderDate": { 
              $gte: new Date("2025-11-01"),
              $lt: new Date("2025-12-01")
            }
          },
          sort: { "orderDate": -1 }
        }
      };

      const selectedQuery = queryPatterns[queryPattern] || queryPatterns.customerOrders;

      // Execute explain plan with detailed analysis
      const explainResult = await collection.find(selectedQuery.filter)
        .sort(selectedQuery.sort || {})
        .limit(selectedQuery.limit || 1000)
        .explain("executionStats");

      // Analyze execution statistics
      const executionStats = explainResult.executionStats;
      const winningPlan = explainResult.queryPlanner.winningPlan;

      const performanceAnalysis = {
        queryPattern: queryPattern,
        executionTime: executionStats.executionTimeMillis,
        documentsExamined: executionStats.totalDocsExamined,
        documentsReturned: executionStats.totalDocsReturned,
        indexesUsed: this.extractIndexesUsed(winningPlan),

        // Performance efficiency metrics
        selectivityRatio: executionStats.totalDocsReturned / Math.max(executionStats.totalDocsExamined, 1),
        indexEfficiency: this.calculateIndexEfficiency(executionStats),

        // Performance classification
        performanceRating: this.classifyPerformance(executionStats),

        // Optimization recommendations
        recommendations: this.generateOptimizationRecommendations(explainResult),

        // Detailed execution breakdown
        executionBreakdown: this.analyzeExecutionStages(winningPlan),

        queryDetails: {
          filter: selectedQuery.filter,
          sort: selectedQuery.sort,
          limit: selectedQuery.limit
        },

        timestamp: new Date()
      };

      // Store performance metrics for trending
      this.performanceMetrics.set(queryPattern, performanceAnalysis);

      console.log(`Query Performance Analysis for ${queryPattern}:`);
      console.log(`  Execution Time: ${performanceAnalysis.executionTime}ms`);
      console.log(`  Documents Examined: ${performanceAnalysis.documentsExamined}`);
      console.log(`  Documents Returned: ${performanceAnalysis.documentsReturned}`);
      console.log(`  Selectivity Ratio: ${performanceAnalysis.selectivityRatio.toFixed(4)}`);
      console.log(`  Performance Rating: ${performanceAnalysis.performanceRating}`);
      console.log(`  Indexes Used: ${JSON.stringify(performanceAnalysis.indexesUsed)}`);

      if (performanceAnalysis.recommendations.length > 0) {
        console.log('  Optimization Recommendations:');
        performanceAnalysis.recommendations.forEach(rec => {
          console.log(`    - ${rec}`);
        });
      }

      return performanceAnalysis;

    } catch (error) {
      console.error('Error analyzing query performance:', error);
      throw error;
    }
  }

  extractIndexesUsed(winningPlan) {
    const indexes = [];

    const extractFromStage = (stage) => {
      if (stage.indexName) {
        indexes.push(stage.indexName);
      }

      if (stage.inputStage) {
        extractFromStage(stage.inputStage);
      }

      if (stage.inputStages) {
        stage.inputStages.forEach(inputStage => {
          extractFromStage(inputStage);
        });
      }
    };

    extractFromStage(winningPlan);
    return [...new Set(indexes)]; // Remove duplicates
  }

  calculateIndexEfficiency(executionStats) {
    // Index efficiency = (docs returned / docs examined) * (1 / execution time factor)
    const selectivity = executionStats.totalDocsReturned / Math.max(executionStats.totalDocsExamined, 1);
    const timeFactor = Math.min(executionStats.executionTimeMillis / 100, 1); // Normalize execution time

    return selectivity * (1 - timeFactor);
  }

  classifyPerformance(executionStats) {
    const { executionTimeMillis, totalDocsExamined, totalDocsReturned } = executionStats;
    const selectivity = totalDocsReturned / Math.max(totalDocsExamined, 1);

    if (executionTimeMillis < 10 && selectivity > 0.1) return 'Excellent';
    if (executionTimeMillis < 50 && selectivity > 0.01) return 'Good';
    if (executionTimeMillis < 100 && selectivity > 0.001) return 'Fair';
    return 'Poor';
  }

  generateOptimizationRecommendations(explainResult) {
    const recommendations = [];
    const executionStats = explainResult.executionStats;
    const winningPlan = explainResult.queryPlanner.winningPlan;

    // High execution time
    if (executionStats.executionTimeMillis > 100) {
      recommendations.push('Consider adding compound indexes for better query selectivity');
    }

    // Low selectivity (examining many documents vs returning few)
    const selectivity = executionStats.totalDocsReturned / Math.max(executionStats.totalDocsExamined, 1);
    if (selectivity < 0.01) {
      recommendations.push('Improve query selectivity with more specific filtering criteria');
    }

    // Collection scan detected
    if (winningPlan.stage === 'COLLSCAN') {
      recommendations.push('Critical: Query is performing collection scan - add appropriate indexes');
    }

    // Sort not using index
    if (this.findStageInPlan(winningPlan, 'SORT') && !this.findStageInPlan(winningPlan, 'IXSCAN')) {
      recommendations.push('Sort operation not using index - consider compound index with sort fields');
    }

    // High key examination
    if (executionStats.totalKeysExamined > executionStats.totalDocsReturned * 10) {
      recommendations.push('High key examination ratio - consider more selective compound indexes');
    }

    return recommendations;
  }

  findStageInPlan(plan, stageName) {
    if (plan.stage === stageName) return true;

    if (plan.inputStage && this.findStageInPlan(plan.inputStage, stageName)) return true;

    if (plan.inputStages) {
      return plan.inputStages.some(stage => this.findStageInPlan(stage, stageName));
    }

    return false;
  }

  analyzeExecutionStages(winningPlan) {
    const stages = [];

    const extractStages = (stage) => {
      stages.push({
        stage: stage.stage,
        indexName: stage.indexName || null,
        direction: stage.direction || null,
        keysExamined: stage.keysExamined || null,
        docsExamined: stage.docsExamined || null,
        executionTimeMillis: stage.executionTimeMillisEstimate || null
      });

      if (stage.inputStage) {
        extractStages(stage.inputStage);
      }

      if (stage.inputStages) {
        stage.inputStages.forEach(inputStage => {
          extractStages(inputStage);
        });
      }
    };

    extractStages(winningPlan);
    return stages;
  }

  async performComprehensiveIndexAnalysis(collection) {
    console.log('Performing comprehensive index analysis...');

    try {
      // Get index statistics
      const indexStats = await collection.aggregate([
        { $indexStats: {} }
      ]).toArray();

      // Get collection statistics
      const collectionStats = await db.runCommand({ collStats: collection.collectionName });

      // Analyze index usage patterns
      const indexAnalysis = indexStats.map(index => {
        const usageStats = index.accesses;
        const indexSize = index.size || 0;
        const indexName = index.name;

        return {
          name: indexName,

          // Usage metrics
          accessCount: usageStats.ops || 0,
          lastAccessed: usageStats.since || null,

          // Size metrics
          sizeBytes: indexSize,
          sizeMB: (indexSize / 1024 / 1024).toFixed(2),

          // Efficiency analysis
          accessFrequency: this.calculateAccessFrequency(usageStats),
          utilizationScore: this.calculateUtilizationScore(usageStats, indexSize),

          // Recommendations
          recommendation: this.analyzeIndexRecommendation(indexName, usageStats, indexSize)
        };
      });

      // Collection-level analysis
      const collectionAnalysis = {
        totalDocuments: collectionStats.count,
        totalSize: collectionStats.size,
        averageDocumentSize: collectionStats.avgObjSize,
        totalIndexSize: collectionStats.totalIndexSize,
        indexToDataRatio: (collectionStats.totalIndexSize / collectionStats.size).toFixed(2),

        // Index efficiency summary
        totalIndexes: indexStats.length,
        activeIndexes: indexStats.filter(idx => idx.accesses.ops > 0).length,
        unusedIndexes: indexStats.filter(idx => idx.accesses.ops === 0).length,

        // Performance indicators
        indexOverhead: ((collectionStats.totalIndexSize / collectionStats.size) * 100).toFixed(1) + '%',

        recommendations: this.generateCollectionRecommendations(indexAnalysis, collectionStats)
      };

      const analysis = {
        collection: collection.collectionName,
        analyzedAt: new Date(),
        collectionMetrics: collectionAnalysis,
        indexDetails: indexAnalysis,

        // Summary classifications
        performanceStatus: this.classifyCollectionPerformance(collectionAnalysis),
        optimizationPriority: this.determineOptimizationPriority(indexAnalysis),

        // Action items
        actionItems: this.generateActionItems(indexAnalysis, collectionAnalysis)
      };

      console.log('Index Analysis Summary:');
      console.log(`  Total Indexes: ${collectionAnalysis.totalIndexes}`);
      console.log(`  Active Indexes: ${collectionAnalysis.activeIndexes}`);  
      console.log(`  Unused Indexes: ${collectionAnalysis.unusedIndexes}`);
      console.log(`  Index Overhead: ${collectionAnalysis.indexOverhead}`);
      console.log(`  Performance Status: ${analysis.performanceStatus}`);

      return analysis;

    } catch (error) {
      console.error('Error performing index analysis:', error);
      throw error;
    }
  }

  calculateAccessFrequency(usageStats) {
    if (!usageStats.since || usageStats.ops === 0) return 'Never';

    const daysSince = (Date.now() - usageStats.since.getTime()) / (1000 * 60 * 60 * 24);
    const accessesPerDay = usageStats.ops / Math.max(daysSince, 1);

    if (accessesPerDay > 1000) return 'Very High';
    if (accessesPerDay > 100) return 'High';
    if (accessesPerDay > 10) return 'Moderate';
    if (accessesPerDay > 1) return 'Low';
    return 'Very Low';
  }

  calculateUtilizationScore(usageStats, indexSize) {
    // Score based on access frequency vs storage cost
    const accessCount = usageStats.ops || 0;
    const sizeCost = indexSize / (1024 * 1024); // Size in MB

    if (accessCount === 0) return 0;

    // Higher score for more accesses per MB of storage
    return Math.min((accessCount / Math.max(sizeCost, 1)) / 1000, 10);
  }

  analyzeIndexRecommendation(indexName, usageStats, indexSize) {
    if (indexName === '_id_') return 'System index - always keep';

    if (usageStats.ops === 0) {
      return 'Consider dropping - unused index consuming storage';
    }

    if (usageStats.ops < 10 && indexSize > 10 * 1024 * 1024) { // < 10 uses and > 10MB
      return 'Low utilization - evaluate if index is necessary';
    }

    if (usageStats.ops > 10000) {
      return 'High utilization - keep and monitor performance';
    }

    return 'Normal utilization - maintain current index';
  }

  generateCollectionRecommendations(indexAnalysis, collectionStats) {
    const recommendations = [];

    // Check for unused indexes
    const unusedIndexes = indexAnalysis.filter(idx => idx.accessCount === 0 && idx.name !== '_id_');
    if (unusedIndexes.length > 0) {
      recommendations.push(`Drop ${unusedIndexes.length} unused indexes to reduce storage overhead`);
    }

    // Check index-to-data ratio
    const indexRatio = collectionStats.totalIndexSize / collectionStats.size;
    if (indexRatio > 1.5) {
      recommendations.push('High index overhead - review index necessity and consider consolidation');
    }

    // Check for very large indexes with low utilization
    const inefficientIndexes = indexAnalysis.filter(idx => 
      idx.sizeBytes > 100 * 1024 * 1024 && idx.utilizationScore < 1
    );
    if (inefficientIndexes.length > 0) {
      recommendations.push('Large indexes with low utilization detected - consider optimization');
    }

    return recommendations;
  }

  classifyCollectionPerformance(collectionAnalysis) {
    const unusedRatio = collectionAnalysis.unusedIndexes / collectionAnalysis.totalIndexes;
    const indexOverheadPercent = parseFloat(collectionAnalysis.indexOverhead);

    if (unusedRatio > 0.3 || indexOverheadPercent > 200) return 'Poor';
    if (unusedRatio > 0.2 || indexOverheadPercent > 150) return 'Fair';
    if (unusedRatio > 0.1 || indexOverheadPercent > 100) return 'Good';
    return 'Excellent';
  }

  determineOptimizationPriority(indexAnalysis) {
    const unusedCount = indexAnalysis.filter(idx => idx.accessCount === 0).length;
    const lowUtilizationCount = indexAnalysis.filter(idx => idx.utilizationScore < 1).length;

    if (unusedCount > 3 || lowUtilizationCount > 5) return 'High';
    if (unusedCount > 1 || lowUtilizationCount > 2) return 'Medium';
    return 'Low';
  }

  generateActionItems(indexAnalysis, collectionAnalysis) {
    const actions = [];

    // Unused index cleanup
    const unusedIndexes = indexAnalysis.filter(idx => idx.accessCount === 0 && idx.name !== '_id_');
    unusedIndexes.forEach(idx => {
      actions.push({
        type: 'DROP_INDEX',
        indexName: idx.name,
        reason: 'Unused index consuming storage',
        priority: 'Medium',
        estimatedSavings: `${idx.sizeMB}MB storage`
      });
    });

    // Low utilization optimization
    const lowUtilizationIndexes = indexAnalysis.filter(idx => 
      idx.utilizationScore < 1 && idx.accessCount > 0 && idx.sizeBytes > 10 * 1024 * 1024
    );
    lowUtilizationIndexes.forEach(idx => {
      actions.push({
        type: 'REVIEW_INDEX',
        indexName: idx.name,
        reason: 'Low utilization for large index',
        priority: 'Low',
        recommendation: 'Evaluate query patterns and consider consolidation'
      });
    });

    return actions;
  }

  async demonstrateAdvancedQuerying(collection) {
    console.log('Demonstrating advanced querying with performance optimization...');

    const queryExamples = [
      {
        name: 'Customer Order History with Analytics',
        query: async () => {
          return await collection.find({
            "customer.customerId": new ObjectId("64a1b2c3d4e5f6789012345a"),
            "orderDate": { $gte: new Date("2025-01-01") }
          })
          .sort({ "orderDate": -1 })
          .limit(20)
          .explain("executionStats");
        }
      },

      {
        name: 'High-Value Recent Orders',
        query: async () => {
          return await collection.find({
            "orderDate": { $gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) },
            "totals.grandTotal": { $gte: 500 },
            "status": { $in: ["processing", "shipped", "delivered"] }
          })
          .sort({ "totals.grandTotal": -1 })
          .limit(50)
          .explain("executionStats");
        }
      },

      {
        name: 'Geographic Sales Analysis',
        query: async () => {
          return await collection.find({
            "addresses.shipping.state": "CA",
            "orderDate": { 
              $gte: new Date("2025-11-01"),
              $lt: new Date("2025-12-01")
            }
          })
          .sort({ "orderDate": -1 })
          .explain("executionStats");
        }
      },

      {
        name: 'Product Category Performance',
        query: async () => {
          return await collection.find({
            "items.category": "electronics",
            "orderDate": { $gte: new Date("2025-11-01") }
          })
          .sort({ "totals.grandTotal": -1 })
          .explain("executionStats");
        }
      }
    ];

    const results = {};

    for (const example of queryExamples) {
      try {
        console.log(`\nTesting: ${example.name}`);
        const result = await example.query();

        const stats = result.executionStats;
        const performance = {
          executionTime: stats.executionTimeMillis,
          documentsExamined: stats.totalDocsExamined,
          documentsReturned: stats.totalDocsReturned,
          indexesUsed: this.extractIndexesUsed(result.queryPlanner.winningPlan),
          efficiency: (stats.totalDocsReturned / Math.max(stats.totalDocsExamined, 1)).toFixed(4)
        };

        console.log(`  Execution Time: ${performance.executionTime}ms`);
        console.log(`  Efficiency Ratio: ${performance.efficiency}`);
        console.log(`  Indexes Used: ${JSON.stringify(performance.indexesUsed)}`);

        results[example.name] = performance;

      } catch (error) {
        console.error(`Error testing ${example.name}:`, error);
        results[example.name] = { error: error.message };
      }
    }

    return results;
  }
}

// Export optimization class
module.exports = { MongoDBIndexOptimizer };

// Benefits of MongoDB Index Optimization:
// - Intelligent compound indexing for complex query patterns
// - Automated performance analysis and recommendations
// - Flexible indexing strategies for evolving schemas
// - Advanced query execution analysis with detailed metrics
// - Comprehensive index utilization monitoring
// - Automated optimization suggestions based on usage patterns
// - Support for specialized indexes (geospatial, text, sparse, partial)
// - Integration with existing MongoDB ecosystem and tooling
// - Real-time performance monitoring and alerting capabilities
// - Cost-effective storage optimization through intelligent index management

Understanding MongoDB Index Architecture

Compound Index Design Patterns

MongoDB's compound indexing system supports sophisticated query optimization strategies:

// Advanced compound indexing patterns for enterprise applications
class CompoundIndexStrategist {
  constructor(db) {
    this.db = db;
    this.indexStrategies = new Map();
    this.queryPatterns = new Map();
  }

  async analyzeQueryPatternsAndCreateIndexes() {
    console.log('Analyzing query patterns and creating optimized compound indexes...');

    // Pattern 1: ESR (Equality, Sort, Range) Index Design
    const esrPattern = {
      description: "Equality-Sort-Range compound index optimization",

      // Customer order queries: customer (equality) + date (sort) + status (range)
      index: {
        "customer.customerId": 1,  // Equality first
        "orderDate": -1,           // Sort second  
        "status": 1                // Range/filter third
      },

      queryExamples: [
        {
          filter: { 
            "customer.customerId": "specific_customer_id",
            "status": { $in: ["processing", "shipped"] }
          },
          sort: { "orderDate": -1 },
          description: "Customer order history with status filtering"
        }
      ],

      performance: "Optimal - follows ESR pattern for maximum efficiency"
    };

    // Pattern 2: Multi-dimensional Analytics Index
    const analyticsPattern = {
      description: "Multi-dimensional analytics with hierarchical grouping",

      index: {
        "orderDate": -1,           // Time dimension (most selective)
        "customer.tier": 1,        // Customer segment
        "items.category": 1,       // Product category
        "totals.grandTotal": -1    // Value dimension
      },

      queryExamples: [
        {
          pipeline: [
            { 
              $match: {
                "orderDate": { $gte: new Date("2025-01-01") },
                "customer.tier": "premium"
              }
            },
            {
              $group: {
                _id: {
                  month: { $dateToString: { format: "%Y-%m", date: "$orderDate" } },
                  category: "$items.category"
                },
                totalRevenue: { $sum: "$totals.grandTotal" },
                orderCount: { $sum: 1 }
              }
            }
          ],
          description: "Monthly revenue by customer tier and product category"
        }
      ]
    };

    // Pattern 3: Geospatial + Business Logic Index
    const geospatialPattern = {
      description: "Geospatial queries combined with business filters",

      index: {
        "addresses.shipping.coordinates": "2dsphere",  // Geospatial first
        "status": 1,                                    // Business filter
        "orderDate": -1                                 // Time component
      },

      queryExamples: [
        {
          filter: {
            "addresses.shipping.coordinates": {
              $near: {
                $geometry: { type: "Point", coordinates: [-122.4194, 37.7749] },
                $maxDistance: 10000 // 10km radius
              }
            },
            "status": "processing",
            "orderDate": { $gte: new Date("2025-11-01") }
          },
          description: "Recent processing orders within geographic radius"
        }
      ]
    };

    // Pattern 4: Text Search + Faceted Filtering
    const textSearchPattern = {
      description: "Full-text search with multiple filter dimensions",

      textIndex: {
        "customer.name": "text",
        "items.name": "text", 
        "items.sku": "text",
        "orderNumber": "text"
      },

      supportingIndexes: [
        {
          "customer.tier": 1,
          "orderDate": -1
        },
        {
          "items.category": 1,
          "totals.grandTotal": -1
        }
      ],

      queryExamples: [
        {
          filter: {
            $text: { $search: "premium widget" },
            "customer.tier": "enterprise",
            "orderDate": { $gte: new Date("2025-10-01") }
          },
          sort: { score: { $meta: "textScore" } },
          description: "Text search with customer tier and date filtering"
        }
      ]
    };

    // Create indexes based on patterns
    const ordersCollection = this.db.collection('orders');

    await this.implementIndexStrategy(ordersCollection, 'ESR_Pattern', esrPattern.index);
    await this.implementIndexStrategy(ordersCollection, 'Analytics_Pattern', analyticsPattern.index);  
    await this.implementIndexStrategy(ordersCollection, 'Geospatial_Pattern', geospatialPattern.index);
    await this.implementTextSearchStrategy(ordersCollection, textSearchPattern);

    // Store strategies for analysis
    this.indexStrategies.set('esr', esrPattern);
    this.indexStrategies.set('analytics', analyticsPattern);
    this.indexStrategies.set('geospatial', geospatialPattern);
    this.indexStrategies.set('textSearch', textSearchPattern);

    console.log('Advanced compound index strategies implemented');
    return this.indexStrategies;
  }

  async implementIndexStrategy(collection, strategyName, indexSpec) {
    try {
      await collection.createIndex(indexSpec, {
        name: `idx_${strategyName.toLowerCase()}`,
        background: true
      });
      console.log(`✅ Created index strategy: ${strategyName}`);
    } catch (error) {
      console.error(`❌ Failed to create ${strategyName}:`, error.message);
    }
  }

  async implementTextSearchStrategy(collection, textPattern) {
    try {
      // Create text index
      await collection.createIndex(textPattern.textIndex, {
        name: "idx_text_search_comprehensive",
        background: true
      });

      // Create supporting indexes for faceted filtering
      for (let i = 0; i < textPattern.supportingIndexes.length; i++) {
        await collection.createIndex(textPattern.supportingIndexes[i], {
          name: `idx_text_support_${i + 1}`,
          background: true
        });
      }

      console.log('✅ Created text search strategy with supporting indexes');
    } catch (error) {
      console.error('❌ Failed to create text search strategy:', error.message);
    }
  }

  async optimizeExistingIndexes(collection) {
    console.log('Optimizing existing indexes based on query patterns...');

    try {
      // Get current indexes
      const currentIndexes = await collection.listIndexes().toArray();

      // Analyze index effectiveness
      const indexAnalysis = await this.analyzeIndexEffectiveness(collection, currentIndexes);

      // Generate optimization plan
      const optimizationPlan = this.createOptimizationPlan(indexAnalysis);

      // Execute optimization (with safety checks)
      await this.executeOptimizationPlan(collection, optimizationPlan);

      return optimizationPlan;

    } catch (error) {
      console.error('Error optimizing indexes:', error);
      throw error;
    }
  }

  async analyzeIndexEffectiveness(collection, indexes) {
    const analysis = [];

    for (const index of indexes) {
      if (index.name === '_id_') continue; // Skip default index

      try {
        // Get index statistics
        const stats = await collection.aggregate([
          { $indexStats: {} },
          { $match: { name: index.name } }
        ]).toArray();

        const indexStat = stats[0];
        if (!indexStat) continue;

        // Analyze index composition
        const indexComposition = this.analyzeIndexComposition(index.key);

        // Calculate efficiency metrics
        const efficiency = {
          usageCount: indexStat.accesses?.ops || 0,
          lastUsed: indexStat.accesses?.since || null,
          sizeBytes: indexStat.size || 0,

          // Index pattern analysis
          composition: indexComposition,
          followsESRPattern: this.checkESRPattern(index.key),
          hasRedundancy: await this.checkIndexRedundancy(collection, index),

          // Performance classification
          utilizationScore: this.calculateUtilizationScore(indexStat),
          efficiencyRating: this.rateIndexEfficiency(indexStat, indexComposition)
        };

        analysis.push({
          name: index.name,
          keyPattern: index.key,
          ...efficiency
        });

      } catch (error) {
        console.warn(`Could not analyze index ${index.name}:`, error.message);
      }
    }

    return analysis;
  }

  analyzeIndexComposition(keyPattern) {
    const keys = Object.keys(keyPattern);
    const composition = {
      fieldCount: keys.length,
      hasEquality: false,
      hasSort: false,
      hasRange: false,
      hasGeospatial: false,
      hasText: false
    };

    keys.forEach((key, index) => {
      const value = keyPattern[key];

      // Detect index type based on value and position
      if (value === 1 || value === -1) {
        if (index === 0) composition.hasEquality = true;
        if (index === 1) composition.hasSort = true;
        if (index > 1) composition.hasRange = true;
      }

      if (value === '2dsphere' || value === '2d') composition.hasGeospatial = true;
      if (value === 'text') composition.hasText = true;
    });

    return composition;
  }

  checkESRPattern(keyPattern) {
    const keys = Object.keys(keyPattern);
    if (keys.length < 3) return false;

    // ESR: First field equality, second sort, third range
    const values = Object.values(keyPattern);
    return (values[0] === 1 || values[0] === -1) &&
           (values[1] === 1 || values[1] === -1) &&
           (values[2] === 1 || values[2] === -1);
  }

  async checkIndexRedundancy(collection, targetIndex) {
    // Check if this index is redundant with other indexes
    const allIndexes = await collection.listIndexes().toArray();
    const targetKeys = Object.keys(targetIndex.key);

    for (const otherIndex of allIndexes) {
      if (otherIndex.name === targetIndex.name || otherIndex.name === '_id_') continue;

      const otherKeys = Object.keys(otherIndex.key);

      // Check if targetIndex is a prefix of otherIndex (redundant)
      if (targetKeys.length <= otherKeys.length) {
        const isPrefix = targetKeys.every((key, index) => 
          otherKeys[index] === key && 
          targetIndex.key[key] === otherIndex.key[key]
        );

        if (isPrefix) return otherIndex.name;
      }
    }

    return false;
  }

  calculateUtilizationScore(indexStat) {
    const usage = indexStat.accesses?.ops || 0;
    const size = indexStat.size || 0;

    if (usage === 0) return 0;
    if (size === 0) return 10; // System indexes

    // Score based on usage per MB
    const sizeMB = size / (1024 * 1024);
    return Math.min((usage / sizeMB) / 100, 10);
  }

  rateIndexEfficiency(indexStat, composition) {
    let score = 5; // Base score

    // Usage factor
    const usage = indexStat.accesses?.ops || 0;
    if (usage > 10000) score += 2;
    else if (usage > 1000) score += 1;
    else if (usage === 0) score -= 3;

    // Composition factor
    if (composition.followsESRPattern) score += 2;
    if (composition.hasGeospatial || composition.hasText) score += 1;
    if (composition.fieldCount > 5) score -= 1; // Too many fields

    // Size factor (prefer smaller indexes for same functionality)
    const sizeMB = (indexStat.size || 0) / (1024 * 1024);
    if (sizeMB > 100) score -= 1;

    return Math.max(Math.min(score, 10), 0);
  }

  createOptimizationPlan(indexAnalysis) {
    const plan = {
      actions: [],
      expectedBenefits: [],
      risks: [],
      estimatedImpact: {}
    };

    // Identify unused indexes
    const unusedIndexes = indexAnalysis.filter(idx => idx.usageCount === 0);
    unusedIndexes.forEach(idx => {
      plan.actions.push({
        type: 'DROP',
        indexName: idx.name,
        reason: 'Unused index consuming storage',
        impact: `Save ${(idx.sizeBytes / 1024 / 1024).toFixed(2)}MB storage`,
        priority: 'HIGH'
      });
    });

    // Identify redundant indexes
    const redundantIndexes = indexAnalysis.filter(idx => idx.hasRedundancy);
    redundantIndexes.forEach(idx => {
      plan.actions.push({
        type: 'DROP',
        indexName: idx.name,
        reason: `Redundant with ${idx.hasRedundancy}`,
        impact: 'Reduce index maintenance overhead',
        priority: 'MEDIUM'
      });
    });

    // Suggest compound index improvements
    const inefficientIndexes = indexAnalysis.filter(idx => 
      idx.efficiencyRating < 5 && idx.usageCount > 0
    );
    inefficientIndexes.forEach(idx => {
      if (!idx.composition.followsESRPattern) {
        plan.actions.push({
          type: 'REBUILD',
          indexName: idx.name,
          reason: 'Does not follow ESR pattern',
          suggestion: 'Reorder fields: Equality, Sort, Range',
          impact: 'Improve query performance',
          priority: 'MEDIUM'
        });
      }
    });

    // Calculate expected benefits
    const storageSavings = unusedIndexes.reduce((sum, idx) => sum + idx.sizeBytes, 0);
    plan.estimatedImpact.storageSavings = `${(storageSavings / 1024 / 1024).toFixed(2)}MB`;
    plan.estimatedImpact.maintenanceReduction = `${unusedIndexes.length + redundantIndexes.length} fewer indexes`;

    return plan;
  }

  async executeOptimizationPlan(collection, plan) {
    console.log('Executing index optimization plan...');

    for (const action of plan.actions) {
      try {
        if (action.type === 'DROP' && action.priority === 'HIGH') {
          // Only auto-execute high-priority drops (unused indexes)
          console.log(`Dropping unused index: ${action.indexName}`);
          await collection.dropIndex(action.indexName);
          console.log(`✅ Successfully dropped index: ${action.indexName}`);
        } else {
          console.log(`📋 Recommended action: ${action.type} ${action.indexName} - ${action.reason}`);
        }
      } catch (error) {
        console.error(`❌ Failed to execute action on ${action.indexName}:`, error.message);
      }
    }

    console.log('Index optimization plan execution completed');
  }

  async generatePerformanceReport(collection) {
    console.log('Generating comprehensive performance report...');

    try {
      // Get collection statistics
      const stats = await this.db.runCommand({ collStats: collection.collectionName });

      // Get index usage statistics
      const indexStats = await collection.aggregate([
        { $indexStats: {} }
      ]).toArray();

      // Analyze recent query performance
      const performanceMetrics = Array.from(this.performanceMetrics.values());

      // Generate comprehensive report
      const report = {
        collectionName: collection.collectionName,
        generatedAt: new Date(),

        // Collection overview
        overview: {
          totalDocuments: stats.count,
          totalSizeGB: (stats.size / 1024 / 1024 / 1024).toFixed(2),
          averageDocumentSizeKB: (stats.avgObjSize / 1024).toFixed(2),
          totalIndexes: indexStats.length,
          totalIndexSizeGB: (stats.totalIndexSize / 1024 / 1024 / 1024).toFixed(2),
          indexToDataRatio: (stats.totalIndexSize / stats.size).toFixed(2)
        },

        // Index performance summary
        indexPerformance: {
          activeIndexes: indexStats.filter(idx => idx.accesses?.ops > 0).length,
          unusedIndexes: indexStats.filter(idx => idx.accesses?.ops === 0).length - 1, // Exclude _id_
          highUtilizationIndexes: indexStats.filter(idx => idx.accesses?.ops > 10000).length,

          // Top performing indexes
          topIndexes: indexStats
            .filter(idx => idx.name !== '_id_' && idx.accesses?.ops > 0)
            .sort((a, b) => (b.accesses?.ops || 0) - (a.accesses?.ops || 0))
            .slice(0, 5)
            .map(idx => ({
              name: idx.name,
              accessCount: idx.accesses?.ops || 0,
              sizeMB: ((idx.size || 0) / 1024 / 1024).toFixed(2)
            }))
        },

        // Query performance analysis
        queryPerformance: {
          totalQueriesAnalyzed: performanceMetrics.length,
          averageExecutionTime: performanceMetrics.length > 0 ? 
            (performanceMetrics.reduce((sum, metric) => sum + metric.executionTime, 0) / performanceMetrics.length).toFixed(2) : 0,
          excellentQueries: performanceMetrics.filter(m => m.performanceRating === 'Excellent').length,
          poorQueries: performanceMetrics.filter(m => m.performanceRating === 'Poor').length,

          // Query patterns
          commonPatterns: this.identifyCommonQueryPatterns(performanceMetrics)
        },

        // Recommendations
        recommendations: this.generatePerformanceRecommendations(stats, indexStats, performanceMetrics),

        // Health score
        healthScore: this.calculateHealthScore(stats, indexStats, performanceMetrics)
      };

      // Display report summary
      console.log('\n📊 Performance Report Summary:');
      console.log(`Collection: ${report.collectionName}`);
      console.log(`Documents: ${report.overview.totalDocuments.toLocaleString()}`);
      console.log(`Data Size: ${report.overview.totalSizeGB}GB`);
      console.log(`Index Size: ${report.overview.totalIndexSizeGB}GB`);
      console.log(`Active Indexes: ${report.indexPerformance.activeIndexes}/${report.overview.totalIndexes}`);
      console.log(`Health Score: ${report.healthScore}/100`);

      if (report.recommendations.length > 0) {
        console.log('\n💡 Top Recommendations:');
        report.recommendations.slice(0, 3).forEach(rec => {
          console.log(`  • ${rec}`);
        });
      }

      return report;

    } catch (error) {
      console.error('Error generating performance report:', error);
      throw error;
    }
  }

  identifyCommonQueryPatterns(performanceMetrics) {
    // Analyze query patterns to identify common access patterns
    const patterns = new Map();

    performanceMetrics.forEach(metric => {
      const pattern = metric.queryPattern || 'unknown';
      if (patterns.has(pattern)) {
        patterns.set(pattern, patterns.get(pattern) + 1);
      } else {
        patterns.set(pattern, 1);
      }
    });

    return Array.from(patterns.entries())
      .sort(([,a], [,b]) => b - a)
      .slice(0, 5)
      .map(([pattern, count]) => ({ pattern, count }));
  }

  generatePerformanceRecommendations(collectionStats, indexStats, queryMetrics) {
    const recommendations = [];

    // Index optimization recommendations
    const unusedCount = indexStats.filter(idx => idx.name !== '_id_' && idx.accesses?.ops === 0).length;
    if (unusedCount > 0) {
      recommendations.push(`Remove ${unusedCount} unused indexes to reduce storage and maintenance overhead`);
    }

    // Size recommendations
    const indexRatio = collectionStats.totalIndexSize / collectionStats.size;
    if (indexRatio > 1.5) {
      recommendations.push('High index-to-data ratio detected - review index necessity');
    }

    // Query performance recommendations
    const poorQueries = queryMetrics.filter(m => m.performanceRating === 'Poor').length;
    if (poorQueries > 0) {
      recommendations.push(`Optimize ${poorQueries} poorly performing query patterns`);
    }

    // Compound index recommendations
    const singleFieldIndexes = indexStats.filter(idx => 
      Object.keys(idx.key || {}).length === 1 && idx.name !== '_id_'
    ).length;
    if (singleFieldIndexes > 5) {
      recommendations.push('Consider consolidating single-field indexes into compound indexes');
    }

    return recommendations;
  }

  calculateHealthScore(collectionStats, indexStats, queryMetrics) {
    let score = 100;

    // Index efficiency penalty
    const unusedIndexes = indexStats.filter(idx => idx.name !== '_id_' && idx.accesses?.ops === 0).length;
    const totalIndexes = indexStats.length - 1; // Exclude _id_
    const unusedRatio = unusedIndexes / Math.max(totalIndexes, 1);
    score -= unusedRatio * 30; // Up to 30 points penalty

    // Size efficiency penalty
    const indexRatio = collectionStats.totalIndexSize / collectionStats.size;
    if (indexRatio > 2) score -= 20;
    else if (indexRatio > 1.5) score -= 10;

    // Query performance penalty
    const poorQueryRatio = queryMetrics.filter(m => m.performanceRating === 'Poor').length / Math.max(queryMetrics.length, 1);
    score -= poorQueryRatio * 25; // Up to 25 points penalty

    // Average execution time penalty
    const avgExecutionTime = queryMetrics.length > 0 ? 
      queryMetrics.reduce((sum, metric) => sum + metric.executionTime, 0) / queryMetrics.length : 0;
    if (avgExecutionTime > 100) score -= 15;
    else if (avgExecutionTime > 50) score -= 8;

    return Math.max(Math.round(score), 0);
  }
}

// Export the compound index strategist
module.exports = { CompoundIndexStrategist };

SQL-Style Index Optimization with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB index optimization and performance tuning:

-- QueryLeaf index optimization with SQL-familiar syntax

-- Create optimized indexes using SQL DDL syntax
CREATE INDEX idx_customer_order_history ON orders (
  customer.customer_id ASC,
  order_date DESC,
  status ASC
) WITH (
  background = true,
  name = 'idx_customer_order_history'
);

-- Create compound indexes following ESR (Equality, Sort, Range) pattern
CREATE INDEX idx_sales_analytics ON orders (
  sales_info.sales_rep_id ASC,     -- Equality filter (most selective)
  order_date DESC,                 -- Sort operation
  totals.grand_total DESC          -- Range filter
) WITH (
  background = true,
  partial_filter = 'status IN (''completed'', ''delivered'')'
);

-- Create geospatial index for location-based queries
CREATE INDEX idx_shipping_location ON orders 
USING GEOSPHERE (addresses.shipping.coordinates)
WITH (background = true);

-- Create text index for search functionality
CREATE INDEX idx_full_text_search ON orders 
USING TEXT (
  customer.name,
  customer.email, 
  order_number,
  items.name,
  items.sku
) WITH (
  default_language = 'english',
  background = true
);

-- Analyze query performance with SQL EXPLAIN
EXPLAIN (ANALYZE true, BUFFERS true) 
SELECT 
  order_number,
  customer.name,
  order_date,
  totals.grand_total,
  status
FROM orders 
WHERE customer.customer_id = ObjectId('64a1b2c3d4e5f6789012345a')
  AND order_date >= CURRENT_DATE - INTERVAL '90 days'
  AND status IN ('processing', 'shipped', 'delivered')
ORDER BY order_date DESC
LIMIT 20;

-- Index usage analysis and optimization recommendations
WITH index_usage_stats AS (
  SELECT 
    index_name,
    access_count,
    last_accessed,
    size_bytes,
    size_mb,

    -- Calculate utilization metrics
    CASE 
      WHEN access_count = 0 THEN 'Unused'
      WHEN access_count < 100 THEN 'Low'
      WHEN access_count < 10000 THEN 'Moderate'
      ELSE 'High'
    END as usage_level,

    -- Calculate efficiency score
    CASE 
      WHEN access_count = 0 THEN 0
      ELSE ROUND((access_count::numeric / (size_mb + 1)) * 100, 2)
    END as efficiency_score

  FROM mongodb_index_statistics('orders')
  WHERE index_name != '_id_'
),
index_recommendations AS (
  SELECT 
    index_name,
    usage_level,
    efficiency_score,
    size_mb,

    -- Generate recommendations based on usage patterns
    CASE 
      WHEN usage_level = 'Unused' THEN 'DROP - Unused index consuming storage'
      WHEN usage_level = 'Low' AND size_mb > 10 THEN 'REVIEW - Low usage for large index'
      WHEN efficiency_score > 1000 THEN 'MAINTAIN - High efficiency index'
      WHEN efficiency_score < 50 THEN 'OPTIMIZE - Poor efficiency ratio'
      ELSE 'MONITOR - Normal usage pattern'
    END as recommendation,

    -- Priority for action
    CASE 
      WHEN usage_level = 'Unused' THEN 'HIGH'
      WHEN usage_level = 'Low' AND size_mb > 50 THEN 'MEDIUM'
      WHEN efficiency_score < 25 THEN 'MEDIUM'
      ELSE 'LOW'
    END as priority

  FROM index_usage_stats
)
SELECT 
  index_name,
  usage_level,
  ROUND(efficiency_score, 2) as efficiency_score,
  ROUND(size_mb, 2) as size_mb,
  recommendation,
  priority,

  -- Estimated impact
  CASE 
    WHEN recommendation LIKE 'DROP%' THEN CONCAT('Save ', ROUND(size_mb, 1), 'MB storage')
    WHEN recommendation LIKE 'OPTIMIZE%' THEN 'Improve query performance'
    ELSE 'Monitor performance'
  END as estimated_impact

FROM index_recommendations
ORDER BY 
  CASE priority 
    WHEN 'HIGH' THEN 1 
    WHEN 'MEDIUM' THEN 2 
    ELSE 3 
  END,
  efficiency_score ASC;

-- Compound index optimization analysis
WITH query_pattern_analysis AS (
  SELECT 
    collection_name,
    query_pattern,
    avg_execution_time_ms,
    avg_docs_examined,
    avg_docs_returned,

    -- Calculate selectivity ratio
    CASE 
      WHEN avg_docs_examined > 0 THEN 
        ROUND((avg_docs_returned::numeric / avg_docs_examined) * 100, 2)
      ELSE 0 
    END as selectivity_percent,

    -- Identify query pattern type
    CASE 
      WHEN query_pattern LIKE '%customer_id%' AND query_pattern LIKE '%order_date%' THEN 'customer_history'
      WHEN query_pattern LIKE '%status%' AND query_pattern LIKE '%priority%' THEN 'fulfillment'
      WHEN query_pattern LIKE '%sales_rep%' THEN 'sales_analytics'
      WHEN query_pattern LIKE '%location%' THEN 'geographic'
      ELSE 'other'
    END as pattern_type

  FROM mongodb_query_performance_log
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days'
    AND collection_name = 'orders'
),
index_optimization_opportunities AS (
  SELECT 
    pattern_type,
    COUNT(*) as query_count,
    AVG(avg_execution_time_ms) as avg_execution_time,
    AVG(selectivity_percent) as avg_selectivity,

    -- Performance classification
    CASE 
      WHEN AVG(avg_execution_time_ms) > 100 THEN 'Poor'
      WHEN AVG(avg_execution_time_ms) > 50 THEN 'Fair'
      WHEN AVG(avg_execution_time_ms) > 10 THEN 'Good'
      ELSE 'Excellent'
    END as performance_rating,

    -- Optimization recommendations
    CASE pattern_type
      WHEN 'customer_history' THEN 'Compound index: customer_id + order_date + status'
      WHEN 'fulfillment' THEN 'Compound index: status + priority + order_date'
      WHEN 'sales_analytics' THEN 'Compound index: sales_rep_id + order_date + total_amount'
      WHEN 'geographic' THEN 'Geospatial index: shipping_coordinates + status + date'
      ELSE 'Analyze query patterns for custom compound index'
    END as index_recommendation

  FROM query_pattern_analysis
  GROUP BY pattern_type
  HAVING COUNT(*) >= 10  -- Only analyze patterns with sufficient volume
)
SELECT 
  pattern_type,
  query_count,
  ROUND(avg_execution_time, 2) as avg_execution_time_ms,
  ROUND(avg_selectivity, 2) as avg_selectivity_percent,
  performance_rating,
  index_recommendation,

  -- Optimization priority
  CASE 
    WHEN performance_rating = 'Poor' AND query_count > 1000 THEN 'CRITICAL'
    WHEN performance_rating IN ('Poor', 'Fair') AND query_count > 100 THEN 'HIGH'
    WHEN performance_rating = 'Fair' THEN 'MEDIUM'
    ELSE 'LOW'
  END as optimization_priority

FROM index_optimization_opportunities
ORDER BY 
  CASE optimization_priority
    WHEN 'CRITICAL' THEN 1
    WHEN 'HIGH' THEN 2 
    WHEN 'MEDIUM' THEN 3
    ELSE 4
  END,
  query_count DESC;

-- Performance monitoring dashboard
WITH performance_metrics AS (
  SELECT 
    DATE_TRUNC('hour', timestamp) as hour_bucket,
    collection_name,

    -- Query performance metrics
    COUNT(*) as total_queries,
    AVG(execution_time_ms) as avg_execution_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY execution_time_ms) as p95_execution_time,
    MAX(execution_time_ms) as max_execution_time,

    -- Index usage metrics
    AVG(docs_examined::numeric / GREATEST(docs_returned, 1)) as avg_docs_per_result,
    AVG(CASE WHEN index_used THEN 1.0 ELSE 0.0 END) as index_usage_ratio,

    -- Query efficiency
    AVG(CASE WHEN docs_examined > 0 THEN docs_returned::numeric / docs_examined ELSE 1 END) as avg_selectivity,

    -- Performance classification
    COUNT(*) FILTER (WHERE execution_time_ms <= 10) as excellent_queries,
    COUNT(*) FILTER (WHERE execution_time_ms <= 50) as good_queries,
    COUNT(*) FILTER (WHERE execution_time_ms <= 100) as fair_queries,
    COUNT(*) FILTER (WHERE execution_time_ms > 100) as poor_queries

  FROM mongodb_query_performance_log
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    AND collection_name = 'orders'
  GROUP BY DATE_TRUNC('hour', timestamp), collection_name
),
performance_trends AS (
  SELECT *,
    -- Calculate performance trends
    LAG(avg_execution_time) OVER (ORDER BY hour_bucket) as prev_hour_avg_time,
    LAG(index_usage_ratio) OVER (ORDER BY hour_bucket) as prev_hour_index_usage,

    -- Performance health score (0-100)
    ROUND(
      (excellent_queries::numeric / total_queries * 40) +
      (good_queries::numeric / total_queries * 30) +
      (fair_queries::numeric / total_queries * 20) +
      (index_usage_ratio * 10),
      0
    ) as performance_health_score

  FROM performance_metrics
)
SELECT 
  TO_CHAR(hour_bucket, 'YYYY-MM-DD HH24:00') as monitoring_hour,
  total_queries,
  ROUND(avg_execution_time::numeric, 2) as avg_execution_time_ms,
  ROUND(p95_execution_time::numeric, 2) as p95_execution_time_ms,
  ROUND((index_usage_ratio * 100)::numeric, 1) as index_usage_percent,
  ROUND((avg_selectivity * 100)::numeric, 2) as avg_selectivity_percent,
  performance_health_score,

  -- Performance distribution
  CONCAT(
    excellent_queries, ' excellent, ',
    good_queries, ' good, ', 
    fair_queries, ' fair, ',
    poor_queries, ' poor'
  ) as query_distribution,

  -- Trend indicators
  CASE 
    WHEN avg_execution_time > prev_hour_avg_time * 1.2 THEN '📈 Degrading'
    WHEN avg_execution_time < prev_hour_avg_time * 0.8 THEN '📉 Improving' 
    ELSE '➡️ Stable'
  END as performance_trend,

  -- Health status
  CASE 
    WHEN performance_health_score >= 90 THEN '🟢 Excellent'
    WHEN performance_health_score >= 75 THEN '🟡 Good'
    WHEN performance_health_score >= 60 THEN '🟠 Fair'
    ELSE '🔴 Poor'
  END as health_status,

  -- Recommendations
  CASE 
    WHEN performance_health_score < 60 THEN 'Immediate optimization required'
    WHEN index_usage_ratio < 0.8 THEN 'Review query patterns and add missing indexes'
    WHEN avg_selectivity < 0.1 THEN 'Improve query selectivity with better filtering'
    WHEN poor_queries > total_queries * 0.1 THEN 'Optimize slow query patterns'
    ELSE 'Performance within acceptable range'
  END as recommendation

FROM performance_trends
WHERE hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
ORDER BY hour_bucket DESC;

-- Index maintenance automation
CREATE PROCEDURE optimize_collection_indexes(
  collection_name VARCHAR(100),
  maintenance_mode VARCHAR(20) DEFAULT 'conservative'
) AS
BEGIN
  -- Analyze current index usage
  WITH index_analysis AS (
    SELECT 
      index_name,
      access_count,
      size_bytes,
      last_accessed,
      CASE 
        WHEN access_count = 0 THEN 'unused'
        WHEN access_count < 10 AND size_bytes > 10 * 1024 * 1024 THEN 'underutilized'
        WHEN access_count > 50000 THEN 'high_usage'
        ELSE 'normal'
      END as usage_category
    FROM mongodb_index_statistics(collection_name)
    WHERE index_name != '_id_'
  )
  SELECT 
    COUNT(*) FILTER (WHERE usage_category = 'unused') as unused_count,
    COUNT(*) FILTER (WHERE usage_category = 'underutilized') as underutilized_count,
    SUM(size_bytes) FILTER (WHERE usage_category = 'unused') as unused_size_bytes
  INTO TEMPORARY TABLE maintenance_summary;

  -- Execute maintenance based on mode
  CASE maintenance_mode
    WHEN 'aggressive' THEN
      -- Drop unused and underutilized indexes
      CALL mongodb_drop_unused_indexes(collection_name);
      CALL mongodb_review_underutilized_indexes(collection_name);

    WHEN 'conservative' THEN 
      -- Only drop clearly unused indexes (0 access, older than 30 days)
      CALL mongodb_drop_unused_indexes(collection_name, min_age_days => 30);

    WHEN 'analyze_only' THEN
      -- Generate report without making changes
      CALL mongodb_generate_index_report(collection_name);
  END CASE;

  -- Log maintenance activity
  INSERT INTO index_maintenance_log (
    collection_name,
    maintenance_mode,
    maintenance_timestamp,
    unused_indexes_dropped,
    storage_saved_bytes
  ) 
  SELECT 
    collection_name,
    maintenance_mode,
    CURRENT_TIMESTAMP,
    (SELECT unused_count FROM maintenance_summary),
    (SELECT unused_size_bytes FROM maintenance_summary);

  COMMIT;
END;

-- QueryLeaf provides comprehensive index optimization capabilities:
-- 1. SQL-familiar index creation and management syntax
-- 2. Advanced compound index strategies with ESR pattern optimization
-- 3. Automated query performance analysis and explain plan interpretation
-- 4. Index usage monitoring and utilization tracking
-- 5. Performance trend analysis and health scoring
-- 6. Automated optimization recommendations based on usage patterns
-- 7. Maintenance procedures for index lifecycle management
-- 8. Integration with MongoDB's native indexing and performance features
-- 9. Real-time performance monitoring with alerting capabilities
-- 10. Familiar SQL patterns for complex index optimization requirements

Best Practices for Index Optimization

Index Design Principles

Essential practices for effective MongoDB index optimization:

ESR Pattern: Design compound indexes following Equality-Sort-Range order
Query-First Design: Create indexes based on actual query patterns, not theoretical needs
Selectivity Optimization: Place most selective fields first in compound indexes
Index Intersection: Leverage MongoDB's ability to use multiple indexes for complex queries
Covering Indexes: Include frequently accessed fields to avoid document lookups
Maintenance Balance: Balance query performance with write performance and storage costs

Performance Monitoring

Implement comprehensive performance monitoring for production environments:

Continuous Analysis: Monitor query performance patterns and execution statistics
Usage Tracking: Track index utilization to identify unused or underutilized indexes
Trend Analysis: Identify performance degradation trends before they impact users
Automated Alerting: Set up alerts for slow queries and index efficiency metrics
Regular Optimization: Schedule periodic index analysis and optimization cycles
Capacity Planning: Monitor index growth and plan for scaling requirements

Conclusion

MongoDB Index Optimization provides comprehensive query performance tuning capabilities that eliminate the complexity and manual overhead of traditional database optimization approaches. The combination of intelligent compound indexing, automated performance analysis, and sophisticated query execution monitoring enables proactive performance management that scales with growing data volumes and evolving access patterns.

Key Index Optimization benefits include:

Intelligent Compound Indexing: Advanced ESR pattern optimization for maximum query efficiency
Automated Performance Analysis: Comprehensive query execution analysis with actionable recommendations
Usage-Based Optimization: Index recommendations based on actual utilization patterns
Comprehensive Monitoring: Real-time performance tracking with trend analysis and alerting
Maintenance Automation: Automated cleanup of unused indexes and optimization suggestions
Developer Familiarity: SQL-style optimization patterns with MongoDB's flexible indexing system

Whether you're building high-traffic web applications, analytics platforms, real-time systems, or any application requiring exceptional database performance, MongoDB Index Optimization with QueryLeaf's familiar SQL interface provides the foundation for enterprise-grade performance engineering. This combination enables you to implement sophisticated optimization strategies while preserving familiar database interaction patterns.

QueryLeaf Integration: QueryLeaf automatically translates SQL index operations into MongoDB index management, providing SQL-familiar CREATE INDEX syntax, EXPLAIN plan analysis, and performance monitoring queries. Advanced optimization strategies, compound index design, and automated maintenance are seamlessly handled through familiar SQL patterns, making enterprise performance optimization both powerful and accessible.

The integration of comprehensive optimization capabilities with SQL-style operations makes MongoDB an ideal platform for applications requiring both exceptional performance and familiar database optimization patterns, ensuring your performance solutions remain both effective and maintainable as they scale and evolve.

November 10, 2025
29 min read

MongoDB Replica Sets for High Availability and Data Resilience: SQL-Compatible Distributed Database Architecture

Modern enterprise applications require database systems that can maintain continuous availability, handle hardware failures gracefully, and provide data redundancy without sacrificing performance or consistency. Traditional single-server database deployments create critical points of failure that can result in extended downtime, data loss, and significant business disruption when servers crash, networks fail, or maintenance windows require database restarts.

MongoDB Replica Sets provide comprehensive high availability and data resilience through automatic replication, intelligent failover, and distributed consensus mechanisms. Unlike traditional master-slave replication that requires manual intervention during failures, MongoDB Replica Sets automatically elect new primary nodes, maintain data consistency across multiple servers, and provide configurable read preferences for optimal performance and availability.

The Traditional High Availability Challenge

Conventional database high availability solutions have significant complexity and operational overhead:

-- Traditional PostgreSQL high availability setup - complex and operationally intensive

-- Primary server configuration with write-ahead logging
CREATE TABLE critical_business_data (
    transaction_id BIGSERIAL PRIMARY KEY,
    account_id BIGINT NOT NULL,
    transaction_type VARCHAR(50) NOT NULL,
    amount DECIMAL(15,2) NOT NULL,
    transaction_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Business logic fields
    from_account BIGINT,
    to_account BIGINT, 
    description TEXT,
    reference_number VARCHAR(100),
    status VARCHAR(20) DEFAULT 'pending',

    -- Audit and compliance
    created_by VARCHAR(100),
    authorized_by VARCHAR(100),
    authorized_at TIMESTAMP,

    -- Geographic and regulatory
    processing_region VARCHAR(50),
    regulatory_flags JSONB,

    -- System metadata
    server_id VARCHAR(50),
    processing_node VARCHAR(50),

    CONSTRAINT valid_transaction_type CHECK (
        transaction_type IN ('deposit', 'withdrawal', 'transfer', 'fee', 'interest', 'adjustment')
    ),
    CONSTRAINT valid_status CHECK (
        status IN ('pending', 'processing', 'completed', 'failed', 'cancelled')
    ),
    CONSTRAINT valid_amount CHECK (amount != 0)
);

-- Complex indexing for performance across multiple servers
CREATE INDEX idx_transactions_account_timestamp ON critical_business_data(account_id, transaction_timestamp DESC);
CREATE INDEX idx_transactions_status_type ON critical_business_data(status, transaction_type, transaction_timestamp);
CREATE INDEX idx_transactions_reference ON critical_business_data(reference_number) WHERE reference_number IS NOT NULL;
CREATE INDEX idx_transactions_region ON critical_business_data(processing_region, transaction_timestamp);

-- Streaming replication configuration (requires extensive setup)
-- postgresql.conf settings required:
-- wal_level = replica
-- max_wal_senders = 3
-- max_replication_slots = 3
-- archive_mode = on
-- archive_command = 'cp %p /var/lib/postgresql/archive/%f'

-- Create replication user (security complexity)
CREATE ROLE replication_user WITH REPLICATION LOGIN PASSWORD 'complex_secure_password';
GRANT CONNECT ON DATABASE production TO replication_user;

-- Manual standby server setup required on each replica
-- pg_basebackup -h primary_server -D /var/lib/postgresql/standby -U replication_user -v -P -W

-- Standby server recovery configuration (recovery.conf)
-- standby_mode = 'on'
-- primary_conninfo = 'host=primary_server port=5432 user=replication_user password=complex_secure_password'
-- trigger_file = '/var/lib/postgresql/failover_trigger'

-- Connection pooling and load balancing (requires external tools)
-- HAProxy configuration for read/write splitting
-- backend postgresql_primary
--   server primary primary_server:5432 check
-- backend postgresql_standby
--   server standby1 standby1_server:5432 check
--   server standby2 standby2_server:5432 check

-- Monitoring and health checking (complex setup)
SELECT 
    client_addr,
    state,
    sync_state,
    sync_priority,

    -- Replication lag monitoring
    pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) / 1024 / 1024 as flush_lag_mb,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / 1024 / 1024 as replay_lag_mb,

    -- Time-based lag analysis
    EXTRACT(EPOCH FROM (now() - backend_start)) as connection_age_seconds,
    EXTRACT(EPOCH FROM (now() - state_change)) as state_change_age_seconds

FROM pg_stat_replication
ORDER BY sync_priority, client_addr;

-- Manual failover procedure (complex and error-prone)
-- 1. Check replication status and lag
-- 2. Stop applications from connecting to primary
-- 3. Ensure all transactions are replicated
-- 4. Create trigger file on desired standby: touch /var/lib/postgresql/failover_trigger
-- 5. Update application connection strings
-- 6. Redirect traffic to new primary
-- 7. Reconfigure remaining standbys to follow new primary

-- Problems with traditional PostgreSQL HA:
-- 1. Complex manual setup and configuration management
-- 2. Manual failover procedures with potential for human error
-- 3. Split-brain scenarios without proper fencing mechanisms
-- 4. No automatic conflict resolution during network partitions
-- 5. Requires external load balancers and connection pooling solutions
-- 6. Limited built-in monitoring and alerting capabilities
-- 7. Difficult to add/remove replica servers dynamically
-- 8. Complex backup and recovery procedures across multiple servers
-- 9. No built-in read preference configuration
-- 10. Requires significant PostgreSQL expertise for proper maintenance

-- MySQL replication limitations (even more manual)
-- Enable binary logging on master:
-- log-bin=mysql-bin
-- server-id=1

-- Manual slave configuration:
CHANGE MASTER TO
  MASTER_HOST='master_server',
  MASTER_USER='replication_user',
  MASTER_PASSWORD='replication_password',
  MASTER_LOG_FILE='mysql-bin.000001',
  MASTER_LOG_POS=154;

START SLAVE;

-- Basic replication monitoring
SHOW SLAVE STATUS\G

-- MySQL HA problems:
-- - No automatic failover mechanisms
-- - Manual binary log position management
-- - Limited conflict resolution capabilities
-- - Basic monitoring and error reporting
-- - Complex setup for multi-master scenarios
-- - No built-in load balancing or read distribution
-- - Requires external tools for comprehensive HA solutions

MongoDB Replica Sets provide comprehensive high availability with minimal operational overhead:

// MongoDB Replica Sets - enterprise-ready high availability with automatic management
const { MongoClient } = require('mongodb');

// Replica set connection with automatic failover handling
const client = new MongoClient('mongodb://server1:27017,server2:27017,server3:27017/production?replicaSet=production-rs', {
  // Connection options optimized for high availability
  maxPoolSize: 50,
  minPoolSize: 5,
  maxIdleTimeMS: 300000, // 5 minutes
  serverSelectionTimeoutMS: 5000,
  heartbeatFrequencyMS: 10000, // 10 seconds

  // Read and write preferences for optimal performance
  readPreference: 'secondaryPreferred', // Distribute read load
  writeConcern: { w: 'majority', j: true, wtimeout: 5000 }, // Ensure data durability

  // Advanced replica set options
  maxStalenessSeconds: 90, // Maximum acceptable replication lag
  readConcern: { level: 'majority' }, // Ensure consistent reads

  // Connection resilience
  connectTimeoutMS: 10000,
  socketTimeoutMS: 30000,
  retryWrites: true,
  retryReads: true
});

const db = client.db('enterprise_production');

// Comprehensive business data model with replica set optimization
const setupEnterpriseCollections = async () => {
  console.log('Setting up enterprise collections with replica set optimization...');

  // Financial transactions with high availability requirements
  const transactions = db.collection('financial_transactions');

  // Sample enterprise transaction document structure
  const transactionDocument = {
    _id: new ObjectId(),

    // Transaction identification
    transactionId: "TXN-2025-11-10-001234567",
    externalReference: "EXT-REF-987654321",

    // Account information
    accounts: {
      sourceAccount: {
        accountId: new ObjectId("64a1b2c3d4e5f6789012347a"),
        accountNumber: "ACC-123456789",
        accountType: "checking",
        accountHolder: "Enterprise Customer LLC",
        bankCode: "ENTBANK001"
      },
      destinationAccount: {
        accountId: new ObjectId("64a1b2c3d4e5f6789012347b"), 
        accountNumber: "ACC-987654321",
        accountType: "savings",
        accountHolder: "Business Partner Inc",
        bankCode: "PARTNER002"
      }
    },

    // Transaction details
    transaction: {
      type: "wire_transfer", // wire_transfer, ach_transfer, check_deposit, etc.
      category: "business_payment",
      subcategory: "vendor_payment",

      // Financial amounts
      amount: {
        value: 125000.00,
        currency: "USD",
        precision: 2
      },

      fees: {
        processingFee: 25.00,
        wireTransferFee: 15.00,
        regulatoryFee: 2.50,
        totalFees: 42.50
      },

      // Exchange rate information (for international transfers)
      exchangeRate: {
        fromCurrency: "USD",
        toCurrency: "USD", 
        rate: 1.0000,
        rateTimestamp: new Date("2025-11-10T14:30:00Z"),
        rateProvider: "internal"
      }
    },

    // Status and workflow tracking
    status: {
      current: "pending_authorization", // pending, authorized, processing, completed, failed, cancelled
      workflow: [
        {
          status: "initiated",
          timestamp: new Date("2025-11-10T14:30:00Z"),
          userId: new ObjectId("64b2c3d4e5f6789012347c1a"),
          userRole: "customer",
          notes: "Transaction initiated via mobile banking"
        },
        {
          status: "validated",
          timestamp: new Date("2025-11-10T14:30:15Z"),
          userId: "system",
          userRole: "automated_validation",
          notes: "Account balance and limits validated"
        }
      ],

      // Authorization requirements
      authorization: {
        required: true,
        level: "dual_approval", // single, dual_approval, committee
        approvals: [
          {
            approverId: new ObjectId("64b2c3d4e5f6789012347c1b"),
            approverRole: "account_manager", 
            status: "approved",
            timestamp: new Date("2025-11-10T14:32:00Z"),
            notes: "Verified customer and transaction purpose"
          }
        ],
        pendingApprovals: [
          {
            approverId: new ObjectId("64b2c3d4e5f6789012347c1c"),
            approverRole: "compliance_officer",
            requiredBy: new Date("2025-11-10T16:30:00Z")
          }
        ]
      }
    },

    // Risk and compliance
    riskAssessment: {
      riskScore: 35, // 0-100 scale
      riskLevel: "medium", // low, medium, high, critical
      riskFactors: [
        {
          factor: "transaction_amount",
          score: 15,
          description: "Large transaction amount"
        },
        {
          factor: "customer_history",
          score: -5,
          description: "Established customer with good history"
        },
        {
          factor: "destination_account",
          score: 10,
          description: "New destination account"
        }
      ],

      complianceChecks: {
        amlScreening: {
          status: "completed",
          result: "clear",
          timestamp: new Date("2025-11-10T14:30:20Z"),
          provider: "compliance_engine"
        },
        sanctionsScreening: {
          status: "completed",
          result: "clear",
          timestamp: new Date("2025-11-10T14:30:22Z"),
          provider: "sanctions_database"
        },
        fraudDetection: {
          status: "completed",
          result: "low_risk", 
          score: 12,
          timestamp: new Date("2025-11-10T14:30:25Z"),
          provider: "fraud_detection_ai"
        }
      }
    },

    // Processing information
    processing: {
      scheduledProcessingTime: new Date("2025-11-10T15:00:00Z"),
      actualProcessingTime: null,
      processingServer: "txn-processor-03",
      processingRegion: "us-east-1",

      // Retry and error handling
      attemptCount: 1,
      maxAttempts: 3,
      lastAttemptTime: new Date("2025-11-10T14:30:00Z"),

      errors: [],

      // Performance tracking
      processingMetrics: {
        validationTimeMs: 150,
        riskAssessmentTimeMs: 250,
        complianceCheckTimeMs: 420,
        totalProcessingTimeMs: null
      }
    },

    // Audit and regulatory compliance
    audit: {
      createdAt: new Date("2025-11-10T14:30:00Z"),
      createdBy: new ObjectId("64b2c3d4e5f6789012347c1a"),
      updatedAt: new Date("2025-11-10T14:32:00Z"),
      updatedBy: new ObjectId("64b2c3d4e5f6789012347c1b"),
      version: 2,

      // Detailed change tracking
      changeLog: [
        {
          timestamp: new Date("2025-11-10T14:30:00Z"),
          userId: new ObjectId("64b2c3d4e5f6789012347c1a"),
          action: "transaction_initiated",
          changes: ["status.current", "transaction", "accounts"],
          ipAddress: "192.168.1.100",
          userAgent: "MobileBankingApp/2.1.3"
        },
        {
          timestamp: new Date("2025-11-10T14:32:00Z"),
          userId: new ObjectId("64b2c3d4e5f6789012347c1b"),
          action: "authorization_approved",
          changes: ["status.authorization.approvals"],
          ipAddress: "10.0.1.50",
          userAgent: "EnterprisePortal/1.8.2"
        }
      ]
    },

    // Geographic and regulatory context
    geography: {
      originatingCountry: "US",
      originatingState: "CA",
      destinationCountry: "US",
      destinationState: "NY",

      // Regulatory requirements by jurisdiction
      regulations: {
        uspCompliance: true,
        internationalTransfer: false,
        reportingThreshold: 10000.00,
        reportingRequired: true,
        reportingDeadline: new Date("2025-11-11T23:59:59Z")
      }
    },

    // System and operational metadata
    metadata: {
      environment: "production",
      dataCenter: "primary",
      applicationVersion: "banking-core-v3.2.1",
      correlationId: "corr-uuid-12345678-90ab-cdef",

      // High availability tracking
      replicaSet: {
        writeConcern: "majority",
        readPreference: "primary",
        maxStalenessSeconds: 60
      },

      // Performance optimization
      indexHints: {
        preferredIndex: "idx_transactions_account_status_date",
        queryOptimizer: "enabled"
      }
    }
  };

  // Insert sample transaction
  await transactions.insertOne(transactionDocument);

  // Create comprehensive indexes optimized for replica set performance
  await Promise.all([
    // Primary business query indexes
    transactions.createIndex(
      { 
        "accounts.sourceAccount.accountId": 1, 
        "audit.createdAt": -1,
        "status.current": 1 
      },
      { 
        name: "idx_transactions_source_account_date_status",
        background: true // Non-blocking index creation
      }
    ),

    transactions.createIndex(
      { 
        "transaction.type": 1,
        "status.current": 1,
        "audit.createdAt": -1 
      },
      { 
        name: "idx_transactions_type_status_date",
        background: true
      }
    ),

    // Risk and compliance queries
    transactions.createIndex(
      { 
        "riskAssessment.riskLevel": 1,
        "status.current": 1 
      },
      { 
        name: "idx_transactions_risk_status",
        background: true
      }
    ),

    // Authorization workflow queries
    transactions.createIndex(
      { 
        "status.authorization.pendingApprovals.approverId": 1,
        "status.current": 1 
      },
      { 
        name: "idx_transactions_pending_approvals",
        background: true
      }
    ),

    // Processing and scheduling queries
    transactions.createIndex(
      { 
        "processing.scheduledProcessingTime": 1,
        "status.current": 1 
      },
      { 
        name: "idx_transactions_scheduled_processing",
        background: true
      }
    ),

    // Geographic and regulatory reporting
    transactions.createIndex(
      { 
        "geography.originatingCountry": 1,
        "geography.regulations.reportingRequired": 1,
        "audit.createdAt": -1 
      },
      { 
        name: "idx_transactions_geography_reporting",
        background: true
      }
    ),

    // Full-text search for transaction descriptions and references
    transactions.createIndex(
      { 
        "transactionId": "text",
        "externalReference": "text",
        "transaction.category": "text",
        "accounts.sourceAccount.accountHolder": "text"
      },
      { 
        name: "idx_transactions_text_search",
        background: true
      }
    )
  ]);

  console.log('Enterprise collections and indexes created successfully');
  return { transactions };
};

// Replica Set Management and Monitoring
class ReplicaSetManager {
  constructor(client, db) {
    this.client = client;
    this.db = db;
    this.admin = client.db('admin');
    this.monitoringInterval = null;
    this.healthMetrics = {
      lastCheck: null,
      replicaSetStatus: null,
      memberHealth: [],
      replicationLag: {},
      alerts: []
    };
  }

  async initializeReplicaSetMonitoring() {
    console.log('Initializing replica set monitoring and management...');

    try {
      // Get initial replica set configuration
      const config = await this.getReplicaSetConfig();
      console.log('Current replica set configuration:', JSON.stringify(config, null, 2));

      // Start continuous health monitoring
      await this.startHealthMonitoring();

      // Setup automatic failover testing (in non-production environments)
      if (process.env.NODE_ENV !== 'production') {
        await this.setupFailoverTesting();
      }

      console.log('Replica set monitoring initialized successfully');

    } catch (error) {
      console.error('Failed to initialize replica set monitoring:', error);
      throw error;
    }
  }

  async getReplicaSetConfig() {
    try {
      // Get replica set configuration
      const config = await this.admin.command({ replSetGetConfig: 1 });

      return {
        setName: config.config._id,
        version: config.config.version,
        members: config.config.members.map(member => ({
          id: member._id,
          host: member.host,
          priority: member.priority,
          votes: member.votes,
          hidden: member.hidden || false,
          buildIndexes: member.buildIndexes !== false,
          tags: member.tags || {}
        })),
        settings: config.config.settings || {}
      };
    } catch (error) {
      console.error('Error getting replica set config:', error);
      throw error;
    }
  }

  async getReplicaSetStatus() {
    try {
      // Get current replica set status
      const status = await this.admin.command({ replSetGetStatus: 1 });

      const members = status.members.map(member => ({
        id: member._id,
        name: member.name,
        health: member.health,
        state: this.getStateDescription(member.state),
        stateStr: member.stateStr,
        uptime: member.uptime,
        optimeDate: member.optimeDate,
        lastHeartbeat: member.lastHeartbeat,
        lastHeartbeatRecv: member.lastHeartbeatRecv,
        pingMs: member.pingMs,
        syncSourceHost: member.syncSourceHost,
        syncSourceId: member.syncSourceId,

        // Replication lag calculation
        lag: status.members[0].optimeDate && member.optimeDate ? 
          Math.abs(status.members[0].optimeDate - member.optimeDate) / 1000 : 0
      }));

      return {
        setName: status.set,
        date: status.date,
        myState: this.getStateDescription(status.myState),
        primary: members.find(member => member.state === 'PRIMARY'),
        members: members,
        heartbeatIntervalMillis: status.heartbeatIntervalMillis
      };

    } catch (error) {
      console.error('Error getting replica set status:', error);
      throw error;
    }
  }

  getStateDescription(state) {
    const states = {
      0: 'STARTUP',
      1: 'PRIMARY',
      2: 'SECONDARY', 
      3: 'RECOVERING',
      5: 'STARTUP2',
      6: 'UNKNOWN',
      7: 'ARBITER',
      8: 'DOWN',
      9: 'ROLLBACK',
      10: 'REMOVED'
    };
    return states[state] || `UNKNOWN_STATE_${state}`;
  }

  async startHealthMonitoring() {
    console.log('Starting continuous replica set health monitoring...');

    // Monitor replica set health every 30 seconds
    this.monitoringInterval = setInterval(async () => {
      try {
        await this.performHealthCheck();
      } catch (error) {
        console.error('Health check failed:', error);
        this.healthMetrics.alerts.push({
          timestamp: new Date(),
          level: 'ERROR',
          message: 'Health check failed',
          error: error.message
        });
      }
    }, 30000);

    // Perform initial health check
    await this.performHealthCheck();
  }

  async performHealthCheck() {
    const checkStartTime = new Date();
    console.log('Performing replica set health check...');

    try {
      // Get current replica set status
      const status = await this.getReplicaSetStatus();
      this.healthMetrics.lastCheck = checkStartTime;
      this.healthMetrics.replicaSetStatus = status;

      // Check for alerts
      const alerts = [];

      // Check if primary is available
      if (!status.primary) {
        alerts.push({
          timestamp: checkStartTime,
          level: 'CRITICAL',
          message: 'No primary member found in replica set'
        });
      }

      // Check member health
      status.members.forEach(member => {
        if (member.health !== 1) {
          alerts.push({
            timestamp: checkStartTime,
            level: 'WARNING',
            message: `Member ${member.name} has health status ${member.health}`
          });
        }

        // Check replication lag
        if (member.lag > 30) { // 30 seconds
          alerts.push({
            timestamp: checkStartTime,
            level: 'WARNING',
            message: `Member ${member.name} has replication lag of ${member.lag} seconds`
          });
        }

        // Check for high ping times
        if (member.pingMs && member.pingMs > 100) { // 100ms
          alerts.push({
            timestamp: checkStartTime,
            level: 'INFO',
            message: `Member ${member.name} has high ping time of ${member.pingMs}ms`
          });
        }
      });

      // Update health metrics
      this.healthMetrics.memberHealth = status.members;
      this.healthMetrics.alerts = [...alerts, ...this.healthMetrics.alerts.slice(0, 50)]; // Keep last 50 alerts

      // Log health status
      if (alerts.length > 0) {
        console.warn(`Health check found ${alerts.length} issues:`, alerts);
      } else {
        console.log('Replica set health check passed - all members healthy');
      }

      // Store health metrics in database for historical analysis
      await this.storeHealthMetrics();

    } catch (error) {
      console.error('Health check error:', error);
      throw error;
    }
  }

  async storeHealthMetrics() {
    try {
      const healthCollection = this.db.collection('replica_set_health');

      const healthRecord = {
        timestamp: this.healthMetrics.lastCheck,
        replicaSetName: this.healthMetrics.replicaSetStatus?.setName,

        // Summary metrics
        summary: {
          totalMembers: this.healthMetrics.memberHealth?.length || 0,
          healthyMembers: this.healthMetrics.memberHealth?.filter(m => m.health === 1).length || 0,
          primaryAvailable: !!this.healthMetrics.replicaSetStatus?.primary,
          maxReplicationLag: Math.max(...this.healthMetrics.memberHealth?.map(m => m.lag) || [0]),
          alertCount: this.healthMetrics.alerts?.filter(a => a.timestamp > new Date(Date.now() - 300000)).length || 0 // Last 5 minutes
        },

        // Detailed member status
        members: this.healthMetrics.memberHealth?.map(member => ({
          name: member.name,
          state: member.state,
          health: member.health,
          uptime: member.uptime,
          replicationLag: member.lag,
          pingMs: member.pingMs,
          isPrimary: member.state === 'PRIMARY'
        })) || [],

        // Recent alerts
        recentAlerts: this.healthMetrics.alerts?.filter(a => 
          a.timestamp > new Date(Date.now() - 300000)
        ) || [],

        // Performance metrics
        performance: {
          healthCheckDuration: Date.now() - this.healthMetrics.lastCheck?.getTime(),
          heartbeatInterval: this.healthMetrics.replicaSetStatus?.heartbeatIntervalMillis
        }
      };

      // Insert with short TTL for cleanup
      await healthCollection.insertOne({
        ...healthRecord,
        expiresAt: new Date(Date.now() + 7 * 24 * 60 * 60 * 1000) // 7 days
      });

    } catch (error) {
      console.warn('Failed to store health metrics:', error);
    }
  }

  async performReadWriteOperations() {
    console.log('Testing read/write operations across replica set...');

    const testCollection = this.db.collection('replica_set_tests');
    const testStartTime = new Date();

    try {
      // Test write operation (will go to primary)
      const writeResult = await testCollection.insertOne({
        testType: 'replica_set_write_test',
        timestamp: testStartTime,
        testData: 'Testing write operation to primary',
        serverId: 'test-operation'
      }, {
        writeConcern: { w: 'majority', j: true, wtimeout: 5000 }
      });

      console.log('Write operation successful:', writeResult.insertedId);

      // Test read from secondary (if available)
      const readFromSecondary = await testCollection.findOne(
        { _id: writeResult.insertedId },
        { 
          readPreference: 'secondary',
          maxStalenessSeconds: 90 
        }
      );

      if (readFromSecondary) {
        const replicationDelay = new Date() - readFromSecondary.timestamp;
        console.log(`Read from secondary successful, replication delay: ${replicationDelay}ms`);
      } else {
        console.log('Read from secondary not available or data not yet replicated');
      }

      // Test read from primary
      const readFromPrimary = await testCollection.findOne(
        { _id: writeResult.insertedId },
        { readPreference: 'primary' }
      );

      console.log('Read from primary successful:', !!readFromPrimary);

      // Cleanup test document
      await testCollection.deleteOne({ _id: writeResult.insertedId });

      return {
        writeSuccessful: true,
        readFromSecondarySuccessful: !!readFromSecondary,
        readFromPrimarySuccessful: !!readFromPrimary,
        testDuration: Date.now() - testStartTime
      };

    } catch (error) {
      console.error('Read/write operation test failed:', error);
      return {
        writeSuccessful: false,
        error: error.message,
        testDuration: Date.now() - testStartTime
      };
    }
  }

  async demonstrateReadPreferences() {
    console.log('Demonstrating various read preferences...');

    const testCollection = this.db.collection('financial_transactions');

    try {
      // 1. Read from primary (default)
      console.log('\n1. Reading from PRIMARY:');
      const primaryStart = Date.now();
      const primaryResult = await testCollection.find({}, {
        readPreference: 'primary'
      }).limit(5).toArray();
      console.log(`   - Read ${primaryResult.length} documents from primary in ${Date.now() - primaryStart}ms`);

      // 2. Read from secondary (load balancing)
      console.log('\n2. Reading from SECONDARY:');
      const secondaryStart = Date.now();
      const secondaryResult = await testCollection.find({}, {
        readPreference: 'secondary',
        maxStalenessSeconds: 120 // Accept data up to 2 minutes old
      }).limit(5).toArray();
      console.log(`   - Read ${secondaryResult.length} documents from secondary in ${Date.now() - secondaryStart}ms`);

      // 3. Read from secondary preferred (fallback to primary)
      console.log('\n3. Reading with SECONDARY PREFERRED:');
      const secondaryPrefStart = Date.now();
      const secondaryPrefResult = await testCollection.find({}, {
        readPreference: 'secondaryPreferred',
        maxStalenessSeconds: 90
      }).limit(5).toArray();
      console.log(`   - Read ${secondaryPrefResult.length} documents with secondary preference in ${Date.now() - secondaryPrefStart}ms`);

      // 4. Read from primary preferred (use secondary if primary unavailable)
      console.log('\n4. Reading with PRIMARY PREFERRED:');
      const primaryPrefStart = Date.now();
      const primaryPrefResult = await testCollection.find({}, {
        readPreference: 'primaryPreferred'
      }).limit(5).toArray();
      console.log(`   - Read ${primaryPrefResult.length} documents with primary preference in ${Date.now() - primaryPrefStart}ms`);

      // 5. Read from nearest (lowest latency)
      console.log('\n5. Reading from NEAREST:');
      const nearestStart = Date.now();
      const nearestResult = await testCollection.find({}, {
        readPreference: 'nearest'
      }).limit(5).toArray();
      console.log(`   - Read ${nearestResult.length} documents from nearest member in ${Date.now() - nearestStart}ms`);

      // 6. Tagged read preference (specific member characteristics)
      console.log('\n6. Reading with TAGGED preferences:');
      try {
        const taggedStart = Date.now();
        const taggedResult = await testCollection.find({}, {
          readPreference: 'secondary',
          readPreferenceTags: [{ region: 'us-east' }, { datacenter: 'primary' }] // Fallback tags
        }).limit(5).toArray();
        console.log(`   - Read ${taggedResult.length} documents with tagged preference in ${Date.now() - taggedStart}ms`);
      } catch (error) {
        console.log('   - Tagged read preference not available (members may not have matching tags)');
      }

      return {
        primaryLatency: Date.now() - primaryStart,
        secondaryLatency: Date.now() - secondaryStart,
        readPreferencesSuccessful: true
      };

    } catch (error) {
      console.error('Read preference demonstration failed:', error);
      return {
        readPreferencesSuccessful: false,
        error: error.message
      };
    }
  }

  async demonstrateWriteConcerns() {
    console.log('Demonstrating various write concerns for data durability...');

    const testCollection = this.db.collection('write_concern_tests');

    try {
      // 1. Default write concern (w: 1)
      console.log('\n1. Testing default write concern (w: 1):');
      const defaultStart = Date.now();
      const defaultResult = await testCollection.insertOne({
        testType: 'default_write_concern',
        timestamp: new Date(),
        data: 'Testing default write concern'
      });
      console.log(`   - Default write completed in ${Date.now() - defaultStart}ms`);

      // 2. Majority write concern (w: 'majority')
      console.log('\n2. Testing majority write concern:');
      const majorityStart = Date.now();
      const majorityResult = await testCollection.insertOne({
        testType: 'majority_write_concern',
        timestamp: new Date(),
        data: 'Testing majority write concern for high durability'
      }, {
        writeConcern: { 
          w: 'majority', 
          j: true, // Wait for journal acknowledgment
          wtimeout: 5000 
        }
      });
      console.log(`   - Majority write completed in ${Date.now() - majorityStart}ms`);

      // 3. Specific member count write concern
      console.log('\n3. Testing specific member count write concern (w: 2):');
      const specificStart = Date.now();
      try {
        const specificResult = await testCollection.insertOne({
          testType: 'specific_count_write_concern',
          timestamp: new Date(),
          data: 'Testing specific member count write concern'
        }, {
          writeConcern: { 
            w: 2, 
            j: true,
            wtimeout: 5000 
          }
        });
        console.log(`   - Specific count write completed in ${Date.now() - specificStart}ms`);
      } catch (error) {
        console.log(`   - Specific count write failed (may not have enough members): ${error.message}`);
      }

      // 4. Tagged write concern
      console.log('\n4. Testing tagged write concern:');
      const taggedStart = Date.now();
      try {
        const taggedResult = await testCollection.insertOne({
          testType: 'tagged_write_concern',
          timestamp: new Date(),
          data: 'Testing tagged write concern'
        }, {
          writeConcern: { 
            w: { region: 'us-east' }, // Write must be acknowledged by members with this tag
            wtimeout: 5000 
          }
        });
        console.log(`   - Tagged write completed in ${Date.now() - taggedStart}ms`);
      } catch (error) {
        console.log(`   - Tagged write failed (members may not have matching tags): ${error.message}`);
      }

      // 5. Unacknowledged write concern (w: 0) - not recommended for production
      console.log('\n5. Testing unacknowledged write concern (fire-and-forget):');
      const unackedStart = Date.now();
      const unackedResult = await testCollection.insertOne({
        testType: 'unacknowledged_write_concern',
        timestamp: new Date(),
        data: 'Testing unacknowledged write concern (not recommended for production)'
      }, {
        writeConcern: { w: 0 }
      });
      console.log(`   - Unacknowledged write completed in ${Date.now() - unackedStart}ms`);

      // Cleanup test documents
      await testCollection.deleteMany({ testType: { $regex: /_write_concern$/ } });

      return {
        allWriteConcernsSuccessful: true,
        performanceMetrics: {
          defaultLatency: Date.now() - defaultStart,
          majorityLatency: Date.now() - majorityStart
        }
      };

    } catch (error) {
      console.error('Write concern demonstration failed:', error);
      return {
        allWriteConcernsSuccessful: false,
        error: error.message
      };
    }
  }

  async getHealthSummary() {
    return {
      lastHealthCheck: this.healthMetrics.lastCheck,
      replicaSetName: this.healthMetrics.replicaSetStatus?.setName,
      totalMembers: this.healthMetrics.memberHealth?.length || 0,
      healthyMembers: this.healthMetrics.memberHealth?.filter(m => m.health === 1).length || 0,
      primaryAvailable: !!this.healthMetrics.replicaSetStatus?.primary,
      maxReplicationLag: Math.max(...this.healthMetrics.memberHealth?.map(m => m.lag) || [0]),
      recentAlertCount: this.healthMetrics.alerts?.filter(a => 
        a.timestamp > new Date(Date.now() - 300000)
      ).length || 0,
      isHealthy: this.isReplicaSetHealthy()
    };
  }

  isReplicaSetHealthy() {
    if (!this.healthMetrics.replicaSetStatus) return false;

    const hasHealthyPrimary = !!this.healthMetrics.replicaSetStatus.primary;
    const allMembersHealthy = this.healthMetrics.memberHealth?.every(m => m.health === 1) || false;
    const lowReplicationLag = Math.max(...this.healthMetrics.memberHealth?.map(m => m.lag) || [0]) < 30;
    const noRecentCriticalAlerts = !this.healthMetrics.alerts?.some(a => 
      a.level === 'CRITICAL' && a.timestamp > new Date(Date.now() - 300000)
    );

    return hasHealthyPrimary && allMembersHealthy && lowReplicationLag && noRecentCriticalAlerts;
  }

  async shutdown() {
    console.log('Shutting down replica set monitoring...');

    if (this.monitoringInterval) {
      clearInterval(this.monitoringInterval);
      this.monitoringInterval = null;
    }

    console.log('Replica set monitoring shutdown complete');
  }
}

// Advanced High Availability Operations
class HighAvailabilityOperations {
  constructor(client, db) {
    this.client = client;
    this.db = db;
    this.admin = client.db('admin');
  }

  async demonstrateFailoverScenarios() {
    console.log('Demonstrating automatic failover capabilities...');

    try {
      // Get initial replica set status
      const initialStatus = await this.admin.command({ replSetGetStatus: 1 });
      const currentPrimary = initialStatus.members.find(member => member.state === 1);

      console.log(`Current primary: ${currentPrimary?.name || 'Unknown'}`);

      // Simulate read/write operations during potential failover
      const operationsPromises = [];

      // Start continuous read operations
      operationsPromises.push(this.performContinuousReads());

      // Start continuous write operations
      operationsPromises.push(this.performContinuousWrites());

      // Monitor replica set status changes
      operationsPromises.push(this.monitorFailoverEvents());

      // Run operations for 60 seconds to demonstrate resilience
      console.log('Running continuous operations to test high availability...');
      await new Promise(resolve => setTimeout(resolve, 60000));

      console.log('High availability demonstration completed');

    } catch (error) {
      console.error('Failover demonstration failed:', error);
    }
  }

  async performContinuousReads() {
    const testCollection = this.db.collection('financial_transactions');
    let readCount = 0;
    let errorCount = 0;

    const readInterval = setInterval(async () => {
      try {
        // Perform read with secondaryPreferred to demonstrate load balancing
        await testCollection.find({}, {
          readPreference: 'secondaryPreferred',
          maxStalenessSeconds: 90
        }).limit(10).toArray();

        readCount++;

        if (readCount % 10 === 0) {
          console.log(`Continuous reads: ${readCount} successful, ${errorCount} errors`);
        }

      } catch (error) {
        errorCount++;
        console.warn(`Read operation failed: ${error.message}`);
      }
    }, 2000); // Read every 2 seconds

    // Stop after 60 seconds
    setTimeout(() => {
      clearInterval(readInterval);
      console.log(`Final read stats: ${readCount} successful, ${errorCount} errors`);
    }, 60000);
  }

  async performContinuousWrites() {
    const testCollection = this.db.collection('ha_test_operations');
    let writeCount = 0;
    let errorCount = 0;

    const writeInterval = setInterval(async () => {
      try {
        // Perform write with majority write concern for durability
        await testCollection.insertOne({
          operationType: 'ha_test_write',
          timestamp: new Date(),
          counter: writeCount + 1,
          testData: `Continuous write operation ${writeCount + 1}`
        }, {
          writeConcern: { w: 'majority', j: true, wtimeout: 5000 }
        });

        writeCount++;

        if (writeCount % 5 === 0) {
          console.log(`Continuous writes: ${writeCount} successful, ${errorCount} errors`);
        }

      } catch (error) {
        errorCount++;
        console.warn(`Write operation failed: ${error.message}`);

        // Implement exponential backoff on write failures
        await new Promise(resolve => setTimeout(resolve, Math.min(1000 * Math.pow(2, errorCount), 10000)));
      }
    }, 5000); // Write every 5 seconds

    // Stop after 60 seconds
    setTimeout(() => {
      clearInterval(writeInterval);
      console.log(`Final write stats: ${writeCount} successful, ${errorCount} errors`);

      // Cleanup test documents
      testCollection.deleteMany({ operationType: 'ha_test_write' }).catch(console.warn);
    }, 60000);
  }

  async monitorFailoverEvents() {
    let lastPrimaryName = null;

    const monitorInterval = setInterval(async () => {
      try {
        const status = await this.admin.command({ replSetGetStatus: 1 });
        const currentPrimary = status.members.find(member => member.state === 1);
        const currentPrimaryName = currentPrimary?.name;

        if (lastPrimaryName && lastPrimaryName !== currentPrimaryName) {
          console.log(`🔄 FAILOVER DETECTED: Primary changed from ${lastPrimaryName} to ${currentPrimaryName || 'NONE'}`);

          // Log failover event
          await this.logFailoverEvent(lastPrimaryName, currentPrimaryName);
        }

        lastPrimaryName = currentPrimaryName;

      } catch (error) {
        console.warn('Failed to monitor replica set status:', error.message);
      }
    }, 5000); // Check every 5 seconds

    // Stop monitoring after 60 seconds
    setTimeout(() => {
      clearInterval(monitorInterval);
    }, 60000);
  }

  async logFailoverEvent(oldPrimary, newPrimary) {
    try {
      const eventsCollection = this.db.collection('failover_events');

      await eventsCollection.insertOne({
        eventType: 'primary_failover',
        timestamp: new Date(),
        oldPrimary: oldPrimary,
        newPrimary: newPrimary,
        detectedBy: 'ha_operations_monitor',
        environment: process.env.NODE_ENV || 'development'
      });

      console.log('Failover event logged to database');

    } catch (error) {
      console.warn('Failed to log failover event:', error);
    }
  }

  async performDataConsistencyCheck() {
    console.log('Performing data consistency check across replica set members...');

    try {
      const testCollection = this.db.collection('consistency_test');

      // Insert test document with majority write concern
      const testDoc = {
        consistencyTestId: new ObjectId(),
        timestamp: new Date(),
        testData: 'Data consistency verification document',
        checksum: 'test-checksum-12345'
      };

      const insertResult = await testCollection.insertOne(testDoc, {
        writeConcern: { w: 'majority', j: true, wtimeout: 10000 }
      });

      console.log(`Test document inserted with ID: ${insertResult.insertedId}`);

      // Wait a moment for replication
      await new Promise(resolve => setTimeout(resolve, 2000));

      // Read from primary
      const primaryRead = await testCollection.findOne(
        { _id: insertResult.insertedId },
        { readPreference: 'primary' }
      );

      // Read from secondary (with retry logic)
      let secondaryRead = null;
      let retryCount = 0;
      const maxRetries = 5;

      while (!secondaryRead && retryCount < maxRetries) {
        try {
          secondaryRead = await testCollection.findOne(
            { _id: insertResult.insertedId },
            { 
              readPreference: 'secondary',
              maxStalenessSeconds: 120 
            }
          );

          if (!secondaryRead) {
            retryCount++;
            console.log(`Retry ${retryCount}: Secondary read returned null, waiting for replication...`);
            await new Promise(resolve => setTimeout(resolve, 1000));
          }
        } catch (error) {
          retryCount++;
          console.warn(`Secondary read attempt ${retryCount} failed: ${error.message}`);
          if (retryCount < maxRetries) {
            await new Promise(resolve => setTimeout(resolve, 1000));
          }
        }
      }

      // Compare results
      const consistent = primaryRead && secondaryRead && 
        primaryRead.checksum === secondaryRead.checksum &&
        primaryRead.testData === secondaryRead.testData;

      console.log(`Data consistency check: ${consistent ? 'PASSED' : 'FAILED'}`);
      console.log(`Primary read successful: ${!!primaryRead}`);
      console.log(`Secondary read successful: ${!!secondaryRead}`);

      if (consistent) {
        console.log('✅ Data is consistent across replica set members');
      } else {
        console.warn('⚠️  Data inconsistency detected');
        console.log('Primary data:', primaryRead);
        console.log('Secondary data:', secondaryRead);
      }

      // Cleanup
      await testCollection.deleteOne({ _id: insertResult.insertedId });

      return {
        consistent,
        primaryReadSuccessful: !!primaryRead,
        secondaryReadSuccessful: !!secondaryRead,
        retryCount
      };

    } catch (error) {
      console.error('Data consistency check failed:', error);
      return {
        consistent: false,
        error: error.message
      };
    }
  }
}

// Example usage and demonstration
const demonstrateReplicaSetCapabilities = async () => {
  console.log('Starting MongoDB Replica Set demonstration...\n');

  try {
    // Setup enterprise collections
    const collections = await setupEnterpriseCollections();
    console.log('✅ Enterprise collections created\n');

    // Initialize replica set management
    const rsManager = new ReplicaSetManager(client, db);
    await rsManager.initializeReplicaSetMonitoring();
    console.log('✅ Replica set monitoring initialized\n');

    // Get replica set configuration and status
    const config = await rsManager.getReplicaSetConfig();
    const status = await rsManager.getReplicaSetStatus();

    console.log('📊 Replica Set Status:');
    console.log(`   Set Name: ${status.setName}`);
    console.log(`   Primary: ${status.primary?.name || 'None'}`);
    console.log(`   Total Members: ${status.members.length}`);
    console.log(`   Healthy Members: ${status.members.filter(m => m.health === 1).length}\n`);

    // Demonstrate read preferences
    await rsManager.demonstrateReadPreferences();
    console.log('✅ Read preferences demonstrated\n');

    // Demonstrate write concerns
    await rsManager.demonstrateWriteConcerns();
    console.log('✅ Write concerns demonstrated\n');

    // Test read/write operations
    const rwTest = await rsManager.performReadWriteOperations();
    console.log('✅ Read/write operations tested:', rwTest);
    console.log();

    // High availability operations
    const haOps = new HighAvailabilityOperations(client, db);

    // Perform data consistency check
    const consistencyResult = await haOps.performDataConsistencyCheck();
    console.log('✅ Data consistency checked:', consistencyResult);
    console.log();

    // Get final health summary
    const healthSummary = await rsManager.getHealthSummary();
    console.log('📋 Final Health Summary:', healthSummary);

    // Cleanup
    await rsManager.shutdown();
    console.log('\n🏁 Replica Set demonstration completed successfully');

  } catch (error) {
    console.error('❌ Demonstration failed:', error);
  }
};

// Export for use in applications
module.exports = {
  setupEnterpriseCollections,
  ReplicaSetManager,
  HighAvailabilityOperations,
  demonstrateReplicaSetCapabilities
};

// Benefits of MongoDB Replica Sets:
// - Automatic failover with no application intervention required
// - Built-in data redundancy across multiple servers and data centers
// - Configurable read preferences for performance optimization
// - Strong consistency guarantees with majority write concerns
// - Rolling upgrades and maintenance without downtime
// - Geographic distribution for disaster recovery
// - Automatic recovery from network partitions and server failures
// - Real-time replication with minimal lag
// - Integration with MongoDB Atlas for managed high availability
// - Comprehensive monitoring and alerting capabilities

Understanding MongoDB Replica Set Architecture

Replica Set Configuration Patterns

MongoDB Replica Sets provide several deployment patterns for different availability and performance requirements:

// Advanced replica set configuration patterns for enterprise deployments
class EnterpriseReplicaSetArchitecture {
  constructor(client) {
    this.client = client;
    this.admin = client.db('admin');
    this.architecturePatterns = new Map();
  }

  async setupProductionArchitecture() {
    console.log('Setting up enterprise production replica set architecture...');

    // Pattern 1: Standard 3-Member Production Setup
    const standardProductionConfig = {
      _id: "production-rs",
      version: 1,
      members: [
        {
          _id: 0,
          host: "db-primary-01.company.com:27017",
          priority: 10, // Higher priority = preferred primary
          votes: 1,
          buildIndexes: true,
          tags: { 
            region: "us-east-1", 
            datacenter: "primary",
            nodeType: "standard",
            environment: "production"
          }
        },
        {
          _id: 1,
          host: "db-secondary-01.company.com:27017", 
          priority: 5,
          votes: 1,
          buildIndexes: true,
          tags: { 
            region: "us-east-1", 
            datacenter: "primary",
            nodeType: "standard",
            environment: "production"
          }
        },
        {
          _id: 2,
          host: "db-secondary-02.company.com:27017",
          priority: 5,
          votes: 1, 
          buildIndexes: true,
          tags: { 
            region: "us-west-2", 
            datacenter: "secondary",
            nodeType: "standard", 
            environment: "production"
          }
        }
      ],
      settings: {
        heartbeatIntervalMillis: 2000, // 2 second heartbeat
        heartbeatTimeoutSecs: 10,      // 10 second timeout
        electionTimeoutMillis: 10000,  // 10 second election timeout
        chainingAllowed: true,         // Allow secondary-to-secondary replication
        getLastErrorModes: {
          "datacenterMajority": { "datacenter": 2 }, // Require writes to both datacenters
          "regionMajority": { "region": 2 }          // Require writes to both regions
        }
      }
    };

    // Pattern 2: 5-Member High Availability with Arbiter
    const highAvailabilityConfig = {
      _id: "ha-production-rs",
      version: 1,
      members: [
        // Primary datacenter members
        {
          _id: 0,
          host: "db-primary-01.company.com:27017",
          priority: 10,
          votes: 1,
          tags: { region: "us-east-1", datacenter: "primary" }
        },
        {
          _id: 1, 
          host: "db-secondary-01.company.com:27017",
          priority: 8,
          votes: 1,
          tags: { region: "us-east-1", datacenter: "primary" }
        },

        // Disaster recovery datacenter members
        {
          _id: 2,
          host: "db-dr-01.company.com:27017",
          priority: 2, // Lower priority for DR
          votes: 1,
          tags: { region: "us-west-2", datacenter: "dr" }
        },
        {
          _id: 3,
          host: "db-dr-02.company.com:27017", 
          priority: 1,
          votes: 1,
          tags: { region: "us-west-2", datacenter: "dr" }
        },

        // Arbiter for odd number of votes (lightweight)
        {
          _id: 4,
          host: "db-arbiter-01.company.com:27017",
          arbiterOnly: true,
          votes: 1,
          tags: { region: "us-central-1", datacenter: "arbiter" }
        }
      ],
      settings: {
        getLastErrorModes: {
          "crossDatacenter": { "datacenter": 2 },
          "disasterRecovery": { "datacenter": 1, "region": 2 }
        }
      }
    };

    // Pattern 3: Analytics-Optimized with Hidden Members
    const analyticsOptimizedConfig = {
      _id: "analytics-rs", 
      version: 1,
      members: [
        // Production data serving members
        {
          _id: 0,
          host: "db-primary-01.company.com:27017",
          priority: 10,
          votes: 1,
          tags: { workload: "transactional", region: "us-east-1" }
        },
        {
          _id: 1,
          host: "db-secondary-01.company.com:27017",
          priority: 5,
          votes: 1,
          tags: { workload: "transactional", region: "us-east-1" }
        },

        // Hidden analytics members (never become primary)
        {
          _id: 2,
          host: "db-analytics-01.company.com:27017",
          priority: 0,    // Cannot become primary
          votes: 0,       // Does not participate in elections
          hidden: true,   // Hidden from application discovery
          buildIndexes: true,
          tags: { 
            workload: "analytics", 
            region: "us-east-1",
            purpose: "reporting" 
          }
        },
        {
          _id: 3,
          host: "db-analytics-02.company.com:27017",
          priority: 0,
          votes: 0, 
          hidden: true,
          buildIndexes: true,
          tags: { 
            workload: "analytics",
            region: "us-east-1", 
            purpose: "etl"
          }
        },

        // Delayed member for disaster recovery
        {
          _id: 4,
          host: "db-delayed-01.company.com:27017",
          priority: 0,
          votes: 0,
          hidden: true,
          buildIndexes: true,
          secondaryDelaySecs: 3600, // 1 hour delay
          tags: { 
            workload: "recovery",
            region: "us-west-2",
            purpose: "delayed_backup"
          }
        }
      ]
    };

    // Store configurations for reference
    this.architecturePatterns.set('standard-production', standardProductionConfig);
    this.architecturePatterns.set('high-availability', highAvailabilityConfig);
    this.architecturePatterns.set('analytics-optimized', analyticsOptimizedConfig);

    console.log('Enterprise replica set architectures configured');
    return this.architecturePatterns;
  }

  async implementReadPreferenceStrategies() {
    console.log('Implementing enterprise read preference strategies...');

    // Strategy 1: Load balancing with geographic preference
    const geographicLoadBalancing = {
      // Primary application reads from nearest secondary
      applicationReads: {
        readPreference: 'secondaryPreferred',
        readPreferenceTags: [
          { region: 'us-east-1', datacenter: 'primary' }, // Prefer local datacenter
          { region: 'us-east-1' },                       // Fallback to region
          {}                                             // Final fallback to any
        ],
        maxStalenessSeconds: 60
      },

      // Analytics reads from dedicated hidden members
      analyticsReads: {
        readPreference: 'secondary',
        readPreferenceTags: [
          { workload: 'analytics', purpose: 'reporting' },
          { workload: 'analytics' }
        ],
        maxStalenessSeconds: 300 // 5 minutes acceptable for analytics
      },

      // Real-time dashboard reads (require fresh data)
      dashboardReads: {
        readPreference: 'primaryPreferred',
        maxStalenessSeconds: 30
      },

      // ETL and batch processing reads
      etlReads: {
        readPreference: 'secondary',
        readPreferenceTags: [
          { purpose: 'etl' },
          { workload: 'analytics' }
        ],
        maxStalenessSeconds: 600 // 10 minutes acceptable for ETL
      }
    };

    // Strategy 2: Write concern patterns for different operations
    const writeConcernStrategies = {
      // Critical financial transactions
      criticalWrites: {
        writeConcern: { 
          w: 'datacenterMajority',  // Custom write concern
          j: true,
          wtimeout: 10000
        },
        description: 'Ensures writes to multiple datacenters'
      },

      // Standard application writes
      standardWrites: {
        writeConcern: {
          w: 'majority',
          j: true, 
          wtimeout: 5000
        },
        description: 'Balances durability and performance'
      },

      // High-volume logging writes
      loggingWrites: {
        writeConcern: {
          w: 1,
          j: false,
          wtimeout: 1000
        },
        description: 'Optimized for throughput'
      },

      // Audit trail writes (maximum durability)
      auditWrites: {
        writeConcern: {
          w: 'regionMajority', // Custom write concern
          j: true,
          wtimeout: 15000
        },
        description: 'Ensures geographic distribution'
      }
    };

    return {
      readPreferences: geographicLoadBalancing,
      writeConcerns: writeConcernStrategies
    };
  }

  async setupMonitoringAndAlerting() {
    console.log('Setting up comprehensive replica set monitoring...');

    const monitoringMetrics = {
      // Replication lag monitoring
      replicationLag: {
        warning: 10,   // seconds
        critical: 30,  // seconds
        query: 'db.runCommand({replSetGetStatus: 1})'
      },

      // Member health monitoring
      memberHealth: {
        checkInterval: 30, // seconds
        alertThreshold: 2, // consecutive failures
        metrics: ['health', 'state', 'uptime', 'lastHeartbeat']
      },

      // Oplog monitoring
      oplogUtilization: {
        warning: 75,   // percent
        critical: 90,  // percent
        retentionTarget: 24 // hours
      },

      // Connection monitoring
      connectionMetrics: {
        maxConnections: 1000,
        warningThreshold: 800,
        monitorActiveConnections: true,
        trackConnectionSources: true
      },

      // Performance monitoring
      performanceMetrics: {
        slowQueryThreshold: 1000, // ms
        indexUsageTracking: true,
        collectionStatsMonitoring: true,
        operationProfiling: {
          enabled: true,
          slowMs: 100,
          sampleRate: 0.1 // 10% sampling
        }
      }
    };

    // Automated alert conditions
    const alertConditions = {
      criticalAlerts: [
        'No primary member available',
        'Majority of members down',
        'Replication lag > 30 seconds',
        'Oplog utilization > 90%'
      ],

      warningAlerts: [
        'Member health issues',
        'Replication lag > 10 seconds', 
        'High connection usage',
        'Slow query patterns detected'
      ],

      infoAlerts: [
        'Primary election occurred',
        'Member added/removed',
        'Configuration change',
        'Index build completed'
      ]
    };

    return {
      metrics: monitoringMetrics,
      alerts: alertConditions
    };
  }

  async performMaintenanceOperations() {
    console.log('Demonstrating maintenance operations...');

    try {
      // 1. Check replica set status before maintenance
      const preMaintenanceStatus = await this.admin.command({ replSetGetStatus: 1 });
      console.log('Pre-maintenance replica set status obtained');

      // 2. Demonstrate rolling maintenance (simulation)
      console.log('Simulating rolling maintenance procedures...');

      const maintenanceProcedures = {
        // Step-by-step rolling upgrade
        rollingUpgrade: [
          '1. Start with secondary members (lowest priority first)',
          '2. Stop MongoDB service on secondary',
          '3. Upgrade MongoDB binaries', 
          '4. Restart with new version',
          '5. Verify member rejoins and catches up',
          '6. Repeat for remaining secondaries',
          '7. Step down primary to make it secondary',
          '8. Upgrade former primary',
          '9. Allow automatic primary election'
        ],

        // Rolling configuration changes
        configurationUpdate: [
          '1. Update secondary member configurations',
          '2. Verify changes take effect',
          '3. Step down primary',
          '4. Update former primary configuration', 
          '5. Verify replica set health'
        ],

        // Index building strategy  
        indexMaintenance: [
          '1. Build indexes on secondaries first',
          '2. Use background: true for minimal impact',
          '3. Monitor replication lag during build',
          '4. Step down primary after secondary indexes complete',
          '5. Build index on former primary'
        ]
      };

      console.log('Rolling maintenance procedures defined:', Object.keys(maintenanceProcedures));

      // 3. Demonstrate graceful primary stepdown
      console.log('Demonstrating graceful primary stepdown...');

      try {
        // Check if we have a primary
        const currentPrimary = preMaintenanceStatus.members.find(m => m.state === 1);

        if (currentPrimary) {
          console.log(`Current primary: ${currentPrimary.name}`);

          // In a real scenario, you would step down the primary:
          // await this.admin.command({ replSetStepDown: 60 }); // Step down for 60 seconds

          console.log('Primary stepdown would be executed here (skipped in demo)');
        } else {
          console.log('No primary found - replica set may be in election');
        }

      } catch (error) {
        console.log('Primary stepdown simulation completed (expected in demo environment)');
      }

      // 4. Maintenance completion verification
      console.log('Verifying replica set health after maintenance...');

      const postMaintenanceChecks = {
        replicationLag: 'Check all members have acceptable lag',
        memberHealth: 'Verify all members are healthy',
        primaryElection: 'Confirm primary is elected and stable', 
        dataConsistency: 'Validate data consistency across members',
        applicationConnectivity: 'Test application reconnection',
        performanceBaseline: 'Confirm performance metrics are normal'
      };

      console.log('Post-maintenance verification checklist:', Object.keys(postMaintenanceChecks));

      return {
        maintenanceProcedures,
        preMaintenanceStatus: preMaintenanceStatus.members.length,
        postMaintenanceChecks: Object.keys(postMaintenanceChecks).length
      };

    } catch (error) {
      console.error('Maintenance operations demonstration failed:', error);
      throw error;
    }
  }

  async demonstrateDisasterRecovery() {
    console.log('Demonstrating disaster recovery capabilities...');

    const disasterRecoveryPlan = {
      // Scenario 1: Primary datacenter failure
      primaryDatacenterFailure: {
        detectionMethods: [
          'Automated health checks detect connectivity loss',
          'Application connection failures increase',
          'Monitoring systems report member unavailability'
        ],

        automaticResponse: [
          'Remaining members detect primary datacenter loss',
          'Automatic election occurs among surviving members',
          'New primary elected in secondary datacenter',
          'Applications automatically reconnect to new primary'
        ],

        recoverySteps: [
          'Verify new primary is stable and accepting writes',
          'Update DNS/load balancer if necessary', 
          'Monitor replication lag on remaining secondaries',
          'Plan primary datacenter restoration'
        ]
      },

      // Scenario 2: Network partition (split-brain prevention)
      networkPartition: {
        scenario: 'Network split isolates primary from majority of members',

        mongodbResponse: [
          'Primary detects loss of majority and steps down',
          'Primary becomes secondary (read-only)',
          'Majority partition elects new primary',
          'Split-brain scenario prevented by majority rule'
        ],

        resolution: [
          'Network partition heals automatically or manually',
          'Isolated member(s) rejoin replica set',
          'Data consistency maintained through oplog replay',
          'Normal operations resume'
        ]
      },

      // Scenario 3: Data corruption recovery
      dataCorruption: {
        detectionMethods: [
          'Checksum validation failures',
          'Application data integrity checks', 
          'MongoDB internal consistency checks'
        ],

        recoveryOptions: [
          'Restore from delayed secondary (if available)',
          'Point-in-time recovery from backup',
          'Partial data recovery and manual intervention',
          'Full replica set restoration from backup'
        ]
      }
    };

    // Demonstrate backup and recovery verification
    const backupVerification = await this.verifyBackupProcedures();

    return {
      disasterScenarios: Object.keys(disasterRecoveryPlan),
      backupVerification
    };
  }

  async verifyBackupProcedures() {
    console.log('Verifying backup and recovery procedures...');

    try {
      // Create a test collection for backup verification
      const backupTestCollection = this.client.db('backup_test').collection('test_data');

      // Insert test data
      await backupTestCollection.insertMany([
        { testId: 1, data: 'Backup verification data 1', timestamp: new Date() },
        { testId: 2, data: 'Backup verification data 2', timestamp: new Date() },
        { testId: 3, data: 'Backup verification data 3', timestamp: new Date() }
      ]);

      // Simulate backup verification steps
      const backupProcedures = {
        backupVerification: [
          'Verify mongodump/mongorestore functionality',
          'Test point-in-time recovery capabilities',
          'Validate backup file integrity',
          'Confirm backup storage accessibility'
        ],

        recoveryTesting: [
          'Restore backup to test environment',
          'Verify data completeness and integrity',
          'Test application connectivity to restored data',
          'Measure recovery time objectives (RTO)'
        ],

        continuousBackup: [
          'Oplog-based continuous backup',
          'Incremental backup strategies',
          'Cross-region backup replication', 
          'Automated backup validation'
        ]
      };

      // Read back test data to verify
      const verificationCount = await backupTestCollection.countDocuments();
      console.log(`Backup verification: ${verificationCount} test documents created`);

      // Cleanup
      await backupTestCollection.drop();

      return {
        backupProceduresVerified: Object.keys(backupProcedures).length,
        testDataVerified: verificationCount === 3
      };

    } catch (error) {
      console.warn('Backup verification failed:', error.message);
      return {
        backupProceduresVerified: 0,
        testDataVerified: false,
        error: error.message
      };
    }
  }

  getArchitectureRecommendations() {
    return {
      production: {
        minimumMembers: 3,
        recommendedMembers: 5,
        arbiterUsage: 'Only when even number of data-bearing members',
        geographicDistribution: 'Multiple datacenters recommended',
        hiddenMembers: 'Use for analytics and backup workloads'
      },

      performance: {
        readPreferences: 'Configure based on workload patterns',
        writeConcerns: 'Balance durability with performance requirements', 
        indexStrategy: 'Build on secondaries first during maintenance',
        connectionPooling: 'Configure appropriate pool sizes'
      },

      monitoring: {
        replicationLag: 'Monitor continuously with alerts',
        memberHealth: 'Automated health checking essential',
        oplogSize: 'Size for expected downtime windows',
        backupTesting: 'Regular backup and recovery testing'
      }
    };
  }
}

// Export the enterprise architecture class
module.exports = { EnterpriseReplicaSetArchitecture };

SQL-Style Replica Set Operations with QueryLeaf

QueryLeaf provides SQL-familiar syntax for MongoDB Replica Set configuration and monitoring:

-- QueryLeaf replica set operations with SQL-familiar syntax

-- Create and configure replica sets using SQL-style syntax
CREATE REPLICA SET production_rs WITH (
  members = [
    { 
      host = 'db-primary-01.company.com:27017',
      priority = 10,
      votes = 1,
      tags = { region = 'us-east-1', datacenter = 'primary' }
    },
    {
      host = 'db-secondary-01.company.com:27017', 
      priority = 5,
      votes = 1,
      tags = { region = 'us-east-1', datacenter = 'primary' }
    },
    {
      host = 'db-secondary-02.company.com:27017',
      priority = 5,
      votes = 1, 
      tags = { region = 'us-west-2', datacenter = 'secondary' }
    }
  ],
  settings = {
    heartbeat_interval = '2 seconds',
    election_timeout = '10 seconds',
    write_concern_modes = {
      datacenter_majority = { datacenter = 2 },
      cross_region = { region = 2 }
    }
  }
);

-- Monitor replica set health with SQL queries
SELECT 
  member_name,
  member_state,
  health_status,
  uptime_seconds,
  replication_lag_seconds,
  last_heartbeat,

  -- Health assessment
  CASE 
    WHEN health_status = 1 AND member_state = 'PRIMARY' THEN 'Healthy Primary'
    WHEN health_status = 1 AND member_state = 'SECONDARY' THEN 'Healthy Secondary'
    WHEN health_status = 0 THEN 'Unhealthy Member'
    ELSE 'Unknown Status'
  END as status_description,

  -- Performance indicators
  CASE
    WHEN replication_lag_seconds > 30 THEN 'High Lag'
    WHEN replication_lag_seconds > 10 THEN 'Moderate Lag'  
    ELSE 'Low Lag'
  END as lag_status,

  -- Connection quality
  CASE
    WHEN ping_ms < 10 THEN 'Excellent'
    WHEN ping_ms < 50 THEN 'Good'
    WHEN ping_ms < 100 THEN 'Fair'
    ELSE 'Poor'
  END as connection_quality

FROM replica_set_status('production_rs')
ORDER BY 
  CASE member_state
    WHEN 'PRIMARY' THEN 1
    WHEN 'SECONDARY' THEN 2
    ELSE 3
  END,
  member_name;

-- Advanced read preference configuration with SQL
SELECT 
  account_id,
  transaction_date,
  amount,
  transaction_type,
  status

FROM financial_transactions 
WHERE transaction_date >= CURRENT_DATE - INTERVAL '30 days'
  AND account_id = '12345'

-- Use read preference for load balancing
WITH READ_PREFERENCE = 'secondary_preferred'
WITH READ_PREFERENCE_TAGS = [
  { region = 'us-east-1', datacenter = 'primary' },
  { region = 'us-east-1' },
  { }  -- fallback to any available member
]
WITH MAX_STALENESS = '60 seconds'

ORDER BY transaction_date DESC
LIMIT 100;

-- Write operations with custom write concerns
INSERT INTO critical_financial_data (
  transaction_id,
  account_from,
  account_to, 
  amount,
  transaction_type,
  created_at
) VALUES (
  'TXN-2025-001234',
  'ACC-123456789',
  'ACC-987654321', 
  1500.00,
  'wire_transfer',
  CURRENT_TIMESTAMP
)
-- Ensure write to multiple datacenters for critical data
WITH WRITE_CONCERN = {
  w = 'datacenter_majority',
  journal = true,
  timeout = '10 seconds'
};

-- Comprehensive replica set analytics
WITH replica_set_metrics AS (
  SELECT 
    rs.replica_set_name,
    rs.member_name,
    rs.member_state,
    rs.health_status,
    rs.uptime_seconds,
    rs.replication_lag_seconds,
    rs.ping_ms,

    -- Time-based analysis
    DATE_TRUNC('hour', rs.check_timestamp) as hour_bucket,
    DATE_TRUNC('day', rs.check_timestamp) as day_bucket,

    -- Performance categorization
    CASE 
      WHEN rs.replication_lag_seconds <= 5 THEN 'excellent'
      WHEN rs.replication_lag_seconds <= 15 THEN 'good'
      WHEN rs.replication_lag_seconds <= 30 THEN 'fair'
      ELSE 'poor'
    END as replication_performance,

    CASE
      WHEN rs.ping_ms <= 10 THEN 'excellent'
      WHEN rs.ping_ms <= 25 THEN 'good'
      WHEN rs.ping_ms <= 50 THEN 'fair'
      ELSE 'poor'
    END as connection_performance

  FROM replica_set_health_history rs
  WHERE rs.check_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
),
performance_summary AS (
  SELECT 
    replica_set_name,
    hour_bucket,

    -- Member availability
    COUNT(*) as total_checks,
    COUNT(*) FILTER (WHERE health_status = 1) as healthy_checks,
    ROUND((COUNT(*) FILTER (WHERE health_status = 1)::numeric / COUNT(*)) * 100, 2) as availability_percent,

    -- Primary stability
    COUNT(DISTINCT member_name) FILTER (WHERE member_state = 'PRIMARY') as primary_changes,

    -- Replication performance
    AVG(replication_lag_seconds) as avg_replication_lag,
    MAX(replication_lag_seconds) as max_replication_lag,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY replication_lag_seconds) as p95_replication_lag,

    -- Connection performance  
    AVG(ping_ms) as avg_ping_ms,
    MAX(ping_ms) as max_ping_ms,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY ping_ms) as p95_ping_ms,

    -- Performance distribution
    COUNT(*) FILTER (WHERE replication_performance = 'excellent') as excellent_replication_count,
    COUNT(*) FILTER (WHERE replication_performance = 'good') as good_replication_count,
    COUNT(*) FILTER (WHERE replication_performance = 'fair') as fair_replication_count,
    COUNT(*) FILTER (WHERE replication_performance = 'poor') as poor_replication_count,

    COUNT(*) FILTER (WHERE connection_performance = 'excellent') as excellent_connection_count,
    COUNT(*) FILTER (WHERE connection_performance = 'good') as good_connection_count,
    COUNT(*) FILTER (WHERE connection_performance = 'fair') as fair_connection_count,
    COUNT(*) FILTER (WHERE connection_performance = 'poor') as poor_connection_count

  FROM replica_set_metrics
  GROUP BY replica_set_name, hour_bucket
),
alerting_analysis AS (
  SELECT 
    ps.*,

    -- SLA compliance (99.9% availability target)
    CASE WHEN ps.availability_percent >= 99.9 THEN 'SLA_COMPLIANT' ELSE 'SLA_BREACH' END as sla_status,

    -- Performance alerts
    CASE 
      WHEN ps.avg_replication_lag > 30 THEN 'CRITICAL_LAG'
      WHEN ps.avg_replication_lag > 15 THEN 'WARNING_LAG' 
      ELSE 'NORMAL_LAG'
    END as lag_alert_level,

    CASE
      WHEN ps.primary_changes > 1 THEN 'UNSTABLE_PRIMARY'
      WHEN ps.primary_changes = 1 THEN 'PRIMARY_CHANGE'
      ELSE 'STABLE_PRIMARY'
    END as primary_stability,

    -- Recommendations
    CASE
      WHEN ps.availability_percent < 99.0 THEN 'Investigate member failures and network issues'
      WHEN ps.avg_replication_lag > 30 THEN 'Check network bandwidth and server performance'
      WHEN ps.primary_changes > 1 THEN 'Analyze primary election patterns and member priorities'
      WHEN ps.avg_ping_ms > 50 THEN 'Investigate network latency between members'
      ELSE 'Performance within acceptable parameters'
    END as recommendation

  FROM performance_summary ps
)
SELECT 
  aa.replica_set_name,
  TO_CHAR(aa.hour_bucket, 'YYYY-MM-DD HH24:00') as monitoring_hour,

  -- Availability metrics
  aa.total_checks,
  aa.availability_percent,
  aa.sla_status,

  -- Replication metrics
  ROUND(aa.avg_replication_lag::numeric, 2) as avg_lag_seconds,
  ROUND(aa.max_replication_lag::numeric, 2) as max_lag_seconds,
  ROUND(aa.p95_replication_lag::numeric, 2) as p95_lag_seconds,
  aa.lag_alert_level,

  -- Connection metrics
  ROUND(aa.avg_ping_ms::numeric, 1) as avg_ping_ms,
  ROUND(aa.max_ping_ms::numeric, 1) as max_ping_ms,

  -- Stability metrics
  aa.primary_changes,
  aa.primary_stability,

  -- Performance distribution
  CONCAT(
    'Excellent: ', aa.excellent_replication_count, 
    ', Good: ', aa.good_replication_count,
    ', Fair: ', aa.fair_replication_count,
    ', Poor: ', aa.poor_replication_count
  ) as replication_distribution,

  -- Operational insights
  aa.recommendation,

  -- Trend indicators
  LAG(aa.availability_percent) OVER (
    PARTITION BY aa.replica_set_name 
    ORDER BY aa.hour_bucket
  ) as prev_hour_availability,

  aa.availability_percent - LAG(aa.availability_percent) OVER (
    PARTITION BY aa.replica_set_name 
    ORDER BY aa.hour_bucket  
  ) as availability_trend

FROM alerting_analysis aa
ORDER BY aa.replica_set_name, aa.hour_bucket DESC;

-- Failover simulation and testing
CREATE PROCEDURE test_failover_scenario(
  replica_set_name VARCHAR(100),
  test_type VARCHAR(50) -- 'primary_stepdown', 'network_partition', 'member_failure'
) AS
BEGIN
  -- Record test start
  INSERT INTO failover_tests (
    replica_set_name,
    test_type,
    test_start_time,
    status
  ) VALUES (
    replica_set_name,
    test_type,
    CURRENT_TIMESTAMP,
    'running'
  );

  -- Execute test based on type
  CASE test_type
    WHEN 'primary_stepdown' THEN
      -- Gracefully step down current primary
      CALL replica_set_step_down(replica_set_name, 60); -- 60 second stepdown

    WHEN 'network_partition' THEN
      -- Simulate network partition (requires external orchestration)
      CALL simulate_network_partition(replica_set_name, '30 seconds');

    WHEN 'member_failure' THEN
      -- Simulate member failure (test environment only)
      CALL simulate_member_failure(replica_set_name, 'secondary', '60 seconds');
  END CASE;

  -- Monitor failover process
  CALL monitor_failover_recovery(replica_set_name);

  -- Record test completion
  UPDATE failover_tests 
  SET 
    test_end_time = CURRENT_TIMESTAMP,
    status = 'completed',
    recovery_time_seconds = EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - test_start_time))
  WHERE replica_set_name = replica_set_name 
    AND test_start_time = (
      SELECT MAX(test_start_time) 
      FROM failover_tests 
      WHERE replica_set_name = replica_set_name
    );
END;

-- Backup and recovery verification
WITH backup_verification AS (
  SELECT 
    backup_name,
    backup_timestamp,
    backup_size_gb,
    backup_type, -- 'full', 'incremental', 'oplog'

    -- Backup validation
    backup_integrity_check,
    restoration_test_status,

    -- Recovery metrics
    estimated_recovery_time_minutes,
    recovery_point_objective_minutes,
    recovery_time_objective_minutes,

    -- Geographic distribution
    backup_locations,
    cross_region_replication_status

  FROM backup_history
  WHERE backup_timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days'
    AND backup_type IN ('full', 'incremental')
),
recovery_readiness AS (
  SELECT 
    COUNT(*) as total_backups,
    COUNT(*) FILTER (WHERE backup_integrity_check = 'passed') as verified_backups,
    COUNT(*) FILTER (WHERE restoration_test_status = 'success') as tested_backups,

    AVG(estimated_recovery_time_minutes) as avg_recovery_time,
    MAX(estimated_recovery_time_minutes) as max_recovery_time,

    -- Compliance assessment
    CASE 
      WHEN COUNT(*) FILTER (WHERE restoration_test_status = 'success') >= 3 THEN 'compliant'
      WHEN COUNT(*) FILTER (WHERE restoration_test_status = 'success') >= 1 THEN 'warning'
      ELSE 'non_compliant'
    END as backup_testing_compliance,

    -- Geographic redundancy
    COUNT(DISTINCT backup_locations) as backup_site_count,

    -- Recommendations
    CASE
      WHEN COUNT(*) FILTER (WHERE restoration_test_status = 'success') = 0 THEN 
        'Schedule immediate backup restoration testing'
      WHEN AVG(estimated_recovery_time_minutes) > recovery_time_objective_minutes THEN
        'Optimize backup strategy to meet RTO requirements'
      WHEN COUNT(DISTINCT backup_locations) < 2 THEN
        'Implement geographic backup distribution'
      ELSE 'Backup and recovery strategy meets requirements'
    END as recommendation

  FROM backup_verification
)
SELECT 
  rr.total_backups,
  rr.verified_backups,
  rr.tested_backups,
  ROUND((rr.tested_backups::numeric / rr.total_backups) * 100, 1) as testing_coverage_percent,

  ROUND(rr.avg_recovery_time, 1) as avg_recovery_time_minutes,
  ROUND(rr.max_recovery_time, 1) as max_recovery_time_minutes,

  rr.backup_testing_compliance,
  rr.backup_site_count,
  rr.recommendation

FROM recovery_readiness rr;

-- QueryLeaf provides comprehensive replica set capabilities:
-- 1. SQL-familiar replica set configuration and management syntax
-- 2. Advanced monitoring and alerting with SQL aggregation functions
-- 3. Read preference and write concern configuration using SQL expressions
-- 4. Comprehensive health analytics with time-series analysis
-- 5. Automated failover testing and recovery verification procedures
-- 6. Backup and recovery management with compliance tracking
-- 7. Performance optimization recommendations based on SQL analytics
-- 8. Integration with existing SQL-based monitoring and reporting systems
-- 9. Geographic distribution and disaster recovery planning with SQL queries
-- 10. Enterprise-grade high availability management using familiar SQL patterns

Best Practices for Replica Set Implementation

Production Deployment Strategies

Essential practices for enterprise MongoDB Replica Set deployments:

Member Configuration: Deploy odd numbers of voting members to prevent election ties
Geographic Distribution: Distribute members across multiple data centers for disaster recovery
Priority Settings: Configure member priorities to control primary election preferences
Hidden Members: Use hidden members for analytics workloads without affecting elections
Arbiter Usage: Deploy arbiters only when necessary to maintain odd voting member counts
Write Concerns: Configure appropriate write concerns for data durability requirements

Performance and Monitoring

Optimize Replica Sets for high-performance, production environments:

Read Preferences: Configure read preferences to distribute load and optimize performance
Replication Lag: Monitor replication lag continuously with automated alerting
Oplog Sizing: Size oplog appropriately for expected maintenance windows and downtime
Connection Pooling: Configure connection pools for optimal resource utilization
Index Building: Build indexes on secondaries first during maintenance windows
Health Monitoring: Implement comprehensive health checking and automated recovery

Conclusion

MongoDB Replica Sets provide comprehensive high availability and data resilience that eliminates the complexity and operational overhead of traditional database replication solutions while ensuring automatic failover, data consistency, and geographic distribution. The native integration with MongoDB's distributed architecture, combined with configurable read preferences and write concerns, makes building highly available applications both powerful and operationally simple.

Key Replica Set benefits include:

Automatic Failover: Seamless primary election and failover without manual intervention
Data Redundancy: Built-in data replication across multiple servers and geographic regions
Read Scaling: Configurable read preferences for optimal performance and load distribution
Strong Consistency: Majority write concerns ensure data durability and consistency
Zero-Downtime Maintenance: Rolling upgrades and maintenance without service interruption
Geographic Distribution: Cross-region deployment for disaster recovery and compliance

Whether you're building financial systems, e-commerce platforms, healthcare applications, or any mission-critical system requiring high availability, MongoDB Replica Sets with QueryLeaf's familiar SQL interface provides the foundation for enterprise-grade data resilience. This combination enables you to implement sophisticated high availability architectures while preserving familiar database interaction patterns.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB Replica Set operations while providing SQL-familiar configuration syntax, monitoring queries, and health analytics functions. Advanced replica set management, read preference configuration, and failover testing are seamlessly handled through familiar SQL patterns, making high availability database management both powerful and accessible.

The integration of native high availability capabilities with SQL-style operations makes MongoDB an ideal platform for applications requiring both enterprise-grade resilience and familiar database interaction patterns, ensuring your high availability solutions remain both effective and maintainable as they scale and evolve.