Blog

November 12, 2025
23 min read

MongoDB Atlas Vector Search for AI Applications: Building Semantic Search and Retrieval-Augmented Generation Systems with SQL-Style Operations

Modern AI applications require sophisticated data retrieval capabilities that go beyond traditional text matching to understand semantic meaning, context, and conceptual similarity. Vector search technology enables applications to find relevant information based on meaning rather than exact keyword matches, powering everything from recommendation engines to retrieval-augmented generation (RAG) systems.

MongoDB Atlas Vector Search provides native vector database capabilities integrated directly into MongoDB's document model, enabling developers to build AI applications without managing separate vector databases. Unlike standalone vector databases that require complex data synchronization and additional infrastructure, Atlas Vector Search combines traditional document operations with vector similarity search in a single, scalable platform.

The Traditional Vector Search Infrastructure Challenge

Building AI applications with traditional vector databases often requires complex, fragmented infrastructure:

-- Traditional PostgreSQL with pgvector extension - complex setup and limited scalability

-- Enable vector extension (requires superuser privileges)
CREATE EXTENSION IF NOT EXISTS vector;

-- Create table for document storage with vector embeddings
CREATE TABLE document_embeddings (
    document_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    title TEXT NOT NULL,
    content TEXT NOT NULL,
    source_url TEXT,
    document_type VARCHAR(50),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Vector embedding column (limited to 16,000 dimensions in pgvector)
    embedding vector(1536), -- OpenAI embedding dimension

    -- Metadata for filtering
    category VARCHAR(100),
    language VARCHAR(10) DEFAULT 'en',
    author VARCHAR(200),
    tags TEXT[],

    -- Full-text search support
    search_vector tsvector GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(content, '')), 'B')
    ) STORED
);

-- Vector similarity index (limited indexing options)
CREATE INDEX embedding_idx ON document_embeddings 
USING ivfflat (embedding vector_cosine_ops) 
WITH (lists = 1000); -- Requires manual tuning

-- Full-text search index
CREATE INDEX document_search_idx ON document_embeddings USING GIN(search_vector);

-- Compound index for metadata filtering
CREATE INDEX document_metadata_idx ON document_embeddings(category, language, created_at);

-- Complex vector similarity search with metadata filtering
WITH vector_search AS (
  SELECT 
    document_id,
    title,
    content,
    category,
    author,
    created_at,

    -- Cosine similarity calculation
    1 - (embedding <=> $1::vector) as similarity_score,

    -- L2 distance (alternative metric)
    embedding <-> $1::vector as l2_distance,

    -- Inner product similarity  
    (embedding <#> $1::vector) * -1 as inner_product_similarity,

    -- Hybrid scoring combining vector and text search
    ts_rank(search_vector, plainto_tsquery('english', $2)) as text_relevance_score

  FROM document_embeddings
  WHERE 
    -- Metadata filtering (applied before vector search for performance)
    category = ANY($3::text[]) 
    AND language = $4
    AND created_at >= $5::timestamp

    -- Optional full-text pre-filtering
    AND (CASE WHEN $2 IS NOT NULL AND $2 != '' 
         THEN search_vector @@ plainto_tsquery('english', $2)
         ELSE true END)
),

ranked_results AS (
  SELECT *,
    -- Hybrid ranking combining multiple signals
    (0.7 * similarity_score + 0.3 * text_relevance_score) as hybrid_score,

    -- Relevance classification
    CASE 
      WHEN similarity_score >= 0.8 THEN 'highly_relevant'
      WHEN similarity_score >= 0.6 THEN 'relevant'  
      WHEN similarity_score >= 0.4 THEN 'somewhat_relevant'
      ELSE 'low_relevance'
    END as relevance_category,

    -- Diversity scoring (for result diversification)
    ROW_NUMBER() OVER (PARTITION BY category ORDER BY similarity_score DESC) as category_rank

  FROM vector_search
  WHERE similarity_score >= 0.3 -- Similarity threshold
),

diversified_results AS (
  SELECT *,
    -- Result diversification logic
    CASE 
      WHEN category_rank <= 2 THEN hybrid_score -- Top 2 per category get full score
      WHEN category_rank <= 5 THEN hybrid_score * 0.8 -- Next 3 get reduced score
      ELSE hybrid_score * 0.5 -- Others get significantly reduced score
    END as diversified_score

  FROM ranked_results
)

SELECT 
  document_id,
  title,
  LEFT(content, 500) as content_preview, -- Truncate for performance
  category,
  author,
  created_at,
  ROUND(similarity_score::numeric, 4) as similarity,
  ROUND(text_relevance_score::numeric, 4) as text_relevance,
  ROUND(diversified_score::numeric, 4) as final_score,
  relevance_category,

  -- Highlight matching terms (requires additional processing)
  ts_headline('english', content, plainto_tsquery('english', $2), 
              'MaxWords=50, MinWords=20, MaxFragments=3') as highlighted_content

FROM diversified_results
ORDER BY diversified_score DESC, similarity_score DESC
LIMIT $6::int -- Result limit parameter
OFFSET $7::int; -- Pagination offset

-- Problems with traditional vector database approaches:
-- 1. Complex infrastructure requiring separate vector database setup and management
-- 2. Limited integration between vector search and traditional document operations
-- 3. Manual index tuning and maintenance for optimal vector search performance
-- 4. Difficult data synchronization between operational databases and vector stores
-- 5. Limited scalability and high operational complexity for production deployments
-- 6. Fragmented query capabilities requiring multiple systems for comprehensive search
-- 7. Complex hybrid search implementations combining vector and traditional search
-- 8. Limited support for real-time updates and dynamic vector index management
-- 9. Expensive infrastructure costs for separate specialized vector database systems
-- 10. Difficult migration paths and vendor lock-in with specialized vector database solutions

-- Pinecone example (proprietary vector database)
-- Requires separate service, API calls, and complex data synchronization
-- Limited filtering capabilities and expensive for large-scale applications
-- No native SQL interface or familiar query patterns

-- Weaviate/Chroma examples similarly require:
-- - Separate infrastructure and service management  
-- - Complex data pipeline orchestration
-- - Limited integration with existing application databases
-- - Expensive scaling and operational complexity

MongoDB Atlas Vector Search provides integrated vector database capabilities:

// MongoDB Atlas Vector Search - native integration with document operations
const { MongoClient } = require('mongodb');

// Advanced Atlas Vector Search system for AI applications
class AtlasVectorSearchManager {
  constructor(connectionString, databaseName) {
    this.client = new MongoClient(connectionString);
    this.db = this.client.db(databaseName);
    this.collections = {
      documents: this.db.collection('documents'),
      embeddings: this.db.collection('embeddings'), 
      searchLogs: this.db.collection('search_logs'),
      userProfiles: this.db.collection('user_profiles')
    };

    this.embeddingDimensions = 1536; // OpenAI embedding size
    this.searchConfigs = new Map();
    this.performanceMetrics = new Map();
  }

  async createVectorSearchIndexes() {
    console.log('Creating optimized vector search indexes for AI applications...');

    try {
      // Primary vector search index for document embeddings
      await this.collections.documents.createSearchIndex({
        name: "document_vector_index",
        type: "vectorSearch",
        definition: {
          "fields": [
            {
              "type": "vector",
              "path": "embedding",
              "numDimensions": this.embeddingDimensions,
              "similarity": "cosine"
            },
            {
              "type": "filter", 
              "path": "metadata.category"
            },
            {
              "type": "filter",
              "path": "metadata.language" 
            },
            {
              "type": "filter",
              "path": "metadata.source"
            },
            {
              "type": "filter",
              "path": "created_at"
            },
            {
              "type": "filter",
              "path": "metadata.tags"
            }
          ]
        }
      });

      // Hybrid search index combining full-text and vector search
      await this.collections.documents.createSearchIndex({
        name: "hybrid_search_index",
        type: "search",
        definition: {
          "mappings": {
            "dynamic": false,
            "fields": {
              "title": {
                "type": "text",
                "analyzer": "lucene.standard"
              },
              "content": {
                "type": "text", 
                "analyzer": "lucene.english"
              },
              "metadata": {
                "type": "document",
                "fields": {
                  "category": {
                    "type": "string"
                  },
                  "tags": {
                    "type": "stringFacet"
                  },
                  "language": {
                    "type": "string"
                  }
                }
              }
            }
          }
        }
      });

      // User preference vector index for personalized search
      await this.collections.userProfiles.createSearchIndex({
        name: "user_preference_vector_index",
        type: "vectorSearch", 
        definition: {
          "fields": [
            {
              "type": "vector",
              "path": "preference_embedding",
              "numDimensions": this.embeddingDimensions,
              "similarity": "cosine"
            },
            {
              "type": "filter",
              "path": "user_id"
            },
            {
              "type": "filter", 
              "path": "profile_type"
            }
          ]
        }
      });

      console.log('Vector search indexes created successfully');
      return { success: true, indexes: ['document_vector_index', 'hybrid_search_index', 'user_preference_vector_index'] };

    } catch (error) {
      console.error('Error creating vector search indexes:', error);
      return { success: false, error: error.message };
    }
  }

  async ingestDocumentsWithEmbeddings(documents, embeddingFunction) {
    console.log(`Ingesting ${documents.length} documents with vector embeddings...`);

    const batchSize = 100;
    const batches = [];
    let totalIngested = 0;

    // Process documents in batches for optimal performance
    for (let i = 0; i < documents.length; i += batchSize) {
      const batch = documents.slice(i, i + batchSize);
      batches.push(batch);
    }

    for (const [batchIndex, batch] of batches.entries()) {
      console.log(`Processing batch ${batchIndex + 1}/${batches.length}`);

      try {
        // Generate embeddings for batch
        const batchTexts = batch.map(doc => `${doc.title}\n\n${doc.content}`);
        const embeddings = await embeddingFunction(batchTexts);

        // Prepare documents with embeddings and metadata
        const enrichedDocuments = batch.map((doc, index) => ({
          _id: doc._id || new ObjectId(),
          title: doc.title,
          content: doc.content,

          // Vector embedding
          embedding: embeddings[index],

          // Rich metadata for filtering and analytics
          metadata: {
            category: doc.category || 'general',
            subcategory: doc.subcategory,
            language: doc.language || 'en',
            source: doc.source || 'unknown',
            source_url: doc.source_url,
            author: doc.author,
            tags: doc.tags || [],

            // Content analysis metadata
            word_count: this.calculateWordCount(doc.content),
            reading_time_minutes: Math.ceil(this.calculateWordCount(doc.content) / 200),
            content_type: this.inferContentType(doc),
            sentiment_score: doc.sentiment_score,

            // Technical metadata
            extraction_method: doc.extraction_method || 'manual',
            processing_version: '1.0',
            quality_score: this.calculateQualityScore(doc)
          },

          // Timestamps
          created_at: doc.created_at || new Date(),
          updated_at: new Date(),
          indexed_at: new Date(),

          // Search optimization fields
          searchable_text: `${doc.title} ${doc.content} ${(doc.tags || []).join(' ')}`,

          // Embedding metadata
          embedding_model: 'text-embedding-ada-002',
          embedding_dimensions: this.embeddingDimensions,
          embedding_created_at: new Date()
        }));

        // Bulk insert with error handling
        const result = await this.collections.documents.insertMany(enrichedDocuments, {
          ordered: false,
          writeConcern: { w: 'majority' }
        });

        totalIngested += result.insertedCount;
        console.log(`Batch ${batchIndex + 1} completed: ${result.insertedCount} documents ingested`);

      } catch (error) {
        console.error(`Error processing batch ${batchIndex + 1}:`, error);
        continue; // Continue with next batch
      }
    }

    console.log(`Document ingestion completed: ${totalIngested}/${documents.length} documents successfully ingested`);
    return {
      success: true,
      totalIngested,
      totalDocuments: documents.length,
      successRate: (totalIngested / documents.length * 100).toFixed(2)
    };
  }

  async performSemanticSearch(queryEmbedding, options = {}) {
    console.log('Performing semantic vector search...');

    const {
      limit = 10,
      categories = [],
      language = null,
      source = null,
      tags = [],
      dateRange = null,
      similarityThreshold = 0.7,
      includeMetadata = true,
      boostFactors = {},
      userProfile = null
    } = options;

    // Build filter criteria
    const filterCriteria = [];

    if (categories.length > 0) {
      filterCriteria.push({
        "metadata.category": { $in: categories }
      });
    }

    if (language) {
      filterCriteria.push({
        "metadata.language": { $eq: language }
      });
    }

    if (source) {
      filterCriteria.push({
        "metadata.source": { $eq: source }
      });
    }

    if (tags.length > 0) {
      filterCriteria.push({
        "metadata.tags": { $in: tags }
      });
    }

    if (dateRange) {
      filterCriteria.push({
        "created_at": {
          $gte: dateRange.start,
          $lte: dateRange.end
        }
      });
    }

    try {
      // Build aggregation pipeline for vector search
      const pipeline = [
        {
          $vectorSearch: {
            index: "document_vector_index",
            path: "embedding",
            queryVector: queryEmbedding,
            numCandidates: limit * 10, // Search more candidates for better results
            limit: limit * 2, // Get extra results for post-processing
            ...(filterCriteria.length > 0 && {
              filter: {
                $and: filterCriteria
              }
            })
          }
        },

        // Add similarity score
        {
          $addFields: {
            similarity_score: { $meta: "vectorSearchScore" }
          }
        },

        // Filter by similarity threshold
        {
          $match: {
            similarity_score: { $gte: similarityThreshold }
          }
        },

        // Add computed fields for ranking
        {
          $addFields: {
            // Content quality boost
            quality_boost: {
              $multiply: [
                "$metadata.quality_score",
                boostFactors.quality || 1.0
              ]
            },

            // Recency boost
            recency_boost: {
              $multiply: [
                {
                  $divide: [
                    { $subtract: [new Date(), "$created_at"] },
                    86400000 * 365 // Days in milliseconds
                  ]
                },
                boostFactors.recency || 0.1
              ]
            },

            // Source authority boost
            source_boost: {
              $switch: {
                branches: [
                  { case: { $eq: ["$metadata.source", "official"] }, then: boostFactors.official || 1.2 },
                  { case: { $eq: ["$metadata.source", "expert"] }, then: boostFactors.expert || 1.1 }
                ],
                default: 1.0
              }
            }
          }
        },

        // Calculate final ranking score
        {
          $addFields: {
            final_score: {
              $multiply: [
                "$similarity_score",
                {
                  $add: [
                    1.0,
                    "$quality_boost",
                    "$recency_boost", 
                    "$source_boost"
                  ]
                }
              ]
            },

            // Relevance classification
            relevance_category: {
              $switch: {
                branches: [
                  { case: { $gte: ["$similarity_score", 0.9] }, then: "highly_relevant" },
                  { case: { $gte: ["$similarity_score", 0.8] }, then: "relevant" },
                  { case: { $gte: ["$similarity_score", 0.7] }, then: "somewhat_relevant" }
                ],
                default: "marginally_relevant"
              }
            }
          }
        },

        // Add personalization if user profile provided
        ...(userProfile ? [{
          $lookup: {
            from: "user_profiles",
            let: { doc_category: "$metadata.category", doc_tags: "$metadata.tags" },
            pipeline: [
              {
                $match: {
                  user_id: userProfile.user_id,
                  $expr: {
                    $or: [
                      { $in: ["$$doc_category", "$preferred_categories"] },
                      { $gt: [{ $size: { $setIntersection: ["$$doc_tags", "$preferred_tags"] } }, 0] }
                    ]
                  }
                }
              }
            ],
            as: "user_preference_match"
          }
        }, {
          $addFields: {
            personalization_boost: {
              $cond: {
                if: { $gt: [{ $size: "$user_preference_match" }, 0] },
                then: boostFactors.personalization || 1.15,
                else: 1.0
              }
            },
            final_score: {
              $multiply: ["$final_score", "$personalization_boost"]
            }
          }
        }] : []),

        // Sort by final score
        {
          $sort: { final_score: -1, similarity_score: -1 }
        },

        // Limit results
        {
          $limit: limit
        },

        // Project final fields
        {
          $project: {
            _id: 1,
            title: 1,
            content: 1,
            ...(includeMetadata && { metadata: 1 }),
            similarity_score: { $round: ["$similarity_score", 4] },
            final_score: { $round: ["$final_score", 4] },
            relevance_category: 1,
            created_at: 1,

            // Generate content snippet
            content_snippet: {
              $substr: ["$content", 0, 300]
            },

            // Search result metadata
            search_metadata: {
              embedding_model: "$embedding_model",
              indexed_at: "$indexed_at",
              quality_score: "$metadata.quality_score"
            }
          }
        }
      ];

      const startTime = Date.now();
      const results = await this.collections.documents.aggregate(pipeline).toArray();
      const searchTime = Date.now() - startTime;

      // Log search performance
      this.recordSearchMetrics({
        query_type: 'semantic_vector_search',
        results_count: results.length,
        search_time_ms: searchTime,
        similarity_threshold: similarityThreshold,
        filters_applied: filterCriteria.length,
        timestamp: new Date()
      });

      console.log(`Semantic search completed: ${results.length} results in ${searchTime}ms`);

      return {
        success: true,
        results: results,
        search_metadata: {
          query_type: 'semantic',
          results_count: results.length,
          search_time_ms: searchTime,
          similarity_threshold: similarityThreshold,
          filters_applied: filterCriteria.length,
          personalized: !!userProfile
        }
      };

    } catch (error) {
      console.error('Semantic search error:', error);
      return {
        success: false,
        error: error.message,
        results: []
      };
    }
  }

  async performHybridSearch(query, queryEmbedding, options = {}) {
    console.log('Performing hybrid search combining text and vector similarity...');

    const {
      limit = 10,
      textWeight = 0.3,
      vectorWeight = 0.7,
      categories = [],
      language = 'en'
    } = options;

    try {
      // Execute vector search
      const vectorResults = await this.performSemanticSearch(queryEmbedding, {
        ...options,
        limit: limit * 2 // Get more results for hybrid ranking
      });

      // Execute text search using Atlas Search
      const textSearchPipeline = [
        {
          $search: {
            index: "hybrid_search_index",
            compound: {
              must: [
                {
                  text: {
                    query: query,
                    path: ["title", "content"],
                    fuzzy: {
                      maxEdits: 2,
                      prefixLength: 3
                    }
                  }
                }
              ],
              ...(categories.length > 0 && {
                filter: [
                  {
                    text: {
                      query: categories,
                      path: "metadata.category"
                    }
                  }
                ]
              })
            },
            highlight: {
              path: "content",
              maxCharsToExamine: 1000,
              maxNumPassages: 3
            }
          }
        },
        {
          $addFields: {
            text_score: { $meta: "searchScore" },
            highlights: { $meta: "searchHighlights" }
          }
        },
        {
          $limit: limit * 2
        }
      ];

      const textResults = await this.collections.documents.aggregate(textSearchPipeline).toArray();

      // Combine and rank results using hybrid scoring
      const combinedResults = this.combineHybridResults(
        vectorResults.results || [], 
        textResults,
        textWeight,
        vectorWeight
      );

      // Sort by hybrid score and limit
      combinedResults.sort((a, b) => b.hybrid_score - a.hybrid_score);
      const finalResults = combinedResults.slice(0, limit);

      return {
        success: true,
        results: finalResults,
        search_metadata: {
          query_type: 'hybrid',
          text_results_count: textResults.length,
          vector_results_count: vectorResults.results?.length || 0,
          combined_results_count: combinedResults.length,
          final_results_count: finalResults.length,
          text_weight: textWeight,
          vector_weight: vectorWeight
        }
      };

    } catch (error) {
      console.error('Hybrid search error:', error);
      return {
        success: false,
        error: error.message,
        results: []
      };
    }
  }

  combineHybridResults(vectorResults, textResults, textWeight, vectorWeight) {
    const resultMap = new Map();

    // Normalize scores to 0-1 range
    const maxVectorScore = Math.max(...vectorResults.map(r => r.similarity_score || 0));
    const maxTextScore = Math.max(...textResults.map(r => r.text_score || 0));

    // Process vector results
    vectorResults.forEach(result => {
      const normalizedVectorScore = maxVectorScore > 0 ? result.similarity_score / maxVectorScore : 0;
      resultMap.set(result._id.toString(), {
        ...result,
        normalized_vector_score: normalizedVectorScore,
        normalized_text_score: 0,
        hybrid_score: normalizedVectorScore * vectorWeight
      });
    });

    // Process text results and combine
    textResults.forEach(result => {
      const normalizedTextScore = maxTextScore > 0 ? result.text_score / maxTextScore : 0;
      const docId = result._id.toString();

      if (resultMap.has(docId)) {
        // Document found in both searches - combine scores
        const existing = resultMap.get(docId);
        existing.normalized_text_score = normalizedTextScore;
        existing.hybrid_score = (existing.normalized_vector_score * vectorWeight) + 
                               (normalizedTextScore * textWeight);
        existing.highlights = result.highlights;
        existing.search_type = 'both';
      } else {
        // Document only found in text search
        resultMap.set(docId, {
          ...result,
          normalized_vector_score: 0,
          normalized_text_score: normalizedTextScore,
          hybrid_score: normalizedTextScore * textWeight,
          search_type: 'text_only',
          similarity_score: 0,
          relevance_category: 'text_match'
        });
      }
    });

    return Array.from(resultMap.values());
  }

  async buildRAGPipeline(query, options = {}) {
    console.log('Building Retrieval-Augmented Generation pipeline...');

    const {
      contextLimit = 5,
      maxContextLength = 4000,
      embeddingFunction,
      llmFunction,
      temperature = 0.7,
      includeSourceCitations = true
    } = options;

    try {
      // Step 1: Generate query embedding
      const queryEmbedding = await embeddingFunction([query]);

      // Step 2: Retrieve relevant context using semantic search
      const searchResults = await this.performSemanticSearch(queryEmbedding[0], {
        limit: contextLimit * 2, // Get extra results for context selection
        similarityThreshold: 0.6
      });

      if (!searchResults.success || searchResults.results.length === 0) {
        return {
          success: false,
          error: 'No relevant context found',
          query: query
        };
      }

      // Step 3: Select and rank context documents
      const contextDocuments = this.selectOptimalContext(
        searchResults.results,
        maxContextLength
      );

      // Step 4: Build context string with source tracking
      const contextString = contextDocuments.map((doc, index) => {
        const sourceId = `[${index + 1}]`;
        return `${sourceId} ${doc.title}\n${doc.content_snippet || doc.content.substring(0, 500)}...`;
      }).join('\n\n');

      // Step 5: Create RAG prompt
      const ragPrompt = this.buildRAGPrompt(query, contextString, includeSourceCitations);

      // Step 6: Generate response using LLM
      const llmResponse = await llmFunction(ragPrompt, {
        temperature,
        max_tokens: 1000,
        stop: ["[END]"]
      });

      // Step 7: Extract citations and build response
      const response = {
        success: true,
        query: query,
        answer: llmResponse.text || llmResponse,
        context_used: contextDocuments.length,
        sources: contextDocuments.map((doc, index) => ({
          id: index + 1,
          title: doc.title,
          similarity_score: doc.similarity_score,
          source: doc.metadata?.source,
          url: doc.metadata?.source_url
        })),
        search_metadata: searchResults.search_metadata,
        generation_metadata: {
          model: llmResponse.model || 'unknown',
          temperature: temperature,
          context_length: contextString.length,
          response_tokens: llmResponse.usage?.total_tokens || 0
        }
      };

      // Log RAG pipeline usage
      await this.logRAGUsage({
        query: query,
        context_documents: contextDocuments.length,
        response_length: response.answer.length,
        sources_cited: response.sources.length,
        timestamp: new Date()
      });

      return response;

    } catch (error) {
      console.error('RAG pipeline error:', error);
      return {
        success: false,
        error: error.message,
        query: query
      };
    }
  }

  selectOptimalContext(searchResults, maxLength) {
    let totalLength = 0;
    const selectedDocs = [];

    // Sort by relevance and diversity
    const rankedResults = searchResults.sort((a, b) => {
      // Primary sort by similarity score
      if (b.similarity_score !== a.similarity_score) {
        return b.similarity_score - a.similarity_score;
      }
      // Secondary sort by content quality
      return (b.metadata?.quality_score || 0) - (a.metadata?.quality_score || 0);
    });

    for (const doc of rankedResults) {
      const docLength = (doc.content_snippet || doc.content || '').length;

      if (totalLength + docLength <= maxLength) {
        selectedDocs.push(doc);
        totalLength += docLength;
      }

      if (selectedDocs.length >= 5) break; // Limit to top 5 documents
    }

    return selectedDocs;
  }

  buildRAGPrompt(query, context, includeCitations) {
    return `You are a helpful assistant that answers questions based on the provided context. Use the context information to provide accurate and comprehensive answers.

Context Information:
${context}

Question: ${query}

Instructions:
- Answer based solely on the information provided in the context
- If the context doesn't contain enough information to answer fully, state what information is missing
- Be comprehensive but concise
${includeCitations ? '- Include source citations using the [number] format from the context' : ''}
- If no relevant information is found, clearly state that the context doesn't contain the answer

Answer:`;
  }

  recordSearchMetrics(metrics) {
    const key = `${metrics.query_type}_${Date.now()}`;
    this.performanceMetrics.set(key, metrics);

    // Keep only last 1000 metrics
    if (this.performanceMetrics.size > 1000) {
      const oldestKey = this.performanceMetrics.keys().next().value;
      this.performanceMetrics.delete(oldestKey);
    }
  }

  async logRAGUsage(usage) {
    try {
      await this.collections.searchLogs.insertOne({
        ...usage,
        type: 'rag_pipeline'
      });
    } catch (error) {
      console.warn('Failed to log RAG usage:', error);
    }
  }

  calculateWordCount(text) {
    return (text || '').split(/\s+/).filter(word => word.length > 0).length;
  }

  inferContentType(doc) {
    if (doc.content && doc.content.includes('```')) return 'technical';
    if (doc.title && doc.title.includes('Tutorial')) return 'tutorial';
    if (doc.content && doc.content.length > 2000) return 'long_form';
    return 'standard';
  }

  calculateQualityScore(doc) {
    let score = 0.5; // Base score

    if (doc.title && doc.title.length > 10) score += 0.1;
    if (doc.content && doc.content.length > 500) score += 0.2;
    if (doc.author) score += 0.1;
    if (doc.tags && doc.tags.length > 0) score += 0.1;

    return Math.min(1.0, score);
  }
}

// Benefits of MongoDB Atlas Vector Search:
// - Native integration with MongoDB document model and operations
// - Automatic scaling and management without separate vector database infrastructure  
// - Advanced filtering capabilities combined with vector similarity search
// - Hybrid search combining full-text and vector search capabilities
// - Built-in indexing optimization for high-performance vector operations
// - Integrated analytics and monitoring for vector search performance
// - Real-time updates and dynamic index management
// - Cost-effective scaling with MongoDB Atlas infrastructure
// - Comprehensive security and compliance features
// - SQL-compatible vector operations through QueryLeaf integration

module.exports = {
  AtlasVectorSearchManager
};

Understanding MongoDB Atlas Vector Search Architecture

Advanced Vector Search Patterns for AI Applications

Implement sophisticated vector search patterns for production AI applications:

// Advanced vector search patterns and AI application integration
class ProductionVectorSearchSystem {
  constructor(atlasConfig) {
    this.atlasManager = new AtlasVectorSearchManager(
      atlasConfig.connectionString, 
      atlasConfig.database
    );
    this.embeddingCache = new Map();
    this.searchCache = new Map();
    this.analyticsCollector = new Map();
  }

  async buildIntelligentDocumentProcessor(documents, processingOptions = {}) {
    console.log('Building intelligent document processing pipeline...');

    const {
      chunkSize = 1000,
      chunkOverlap = 200,
      embeddingModel = 'text-embedding-ada-002',
      enableSemanticChunking = true,
      extractKeywords = true,
      analyzeSentiment = true
    } = processingOptions;

    const processedDocuments = [];

    for (const doc of documents) {
      try {
        // Step 1: Intelligent document chunking
        const chunks = enableSemanticChunking ? 
          await this.performSemanticChunking(doc.content, chunkSize, chunkOverlap) :
          this.performFixedChunking(doc.content, chunkSize, chunkOverlap);

        // Step 2: Process each chunk
        for (const [chunkIndex, chunk] of chunks.entries()) {
          const chunkDoc = {
            _id: new ObjectId(),
            parent_document_id: doc._id,
            title: `${doc.title} - Part ${chunkIndex + 1}`,
            content: chunk.text,
            chunk_index: chunkIndex,

            // Chunk metadata
            chunk_metadata: {
              word_count: chunk.word_count,
              sentence_count: chunk.sentence_count,
              start_position: chunk.start_position,
              end_position: chunk.end_position,
              semantic_density: chunk.semantic_density || 0
            },

            // Enhanced metadata processing
            metadata: {
              ...doc.metadata,
              // Keyword extraction
              ...(extractKeywords && {
                keywords: await this.extractKeywords(chunk.text),
                entities: await this.extractEntities(chunk.text)
              }),

              // Sentiment analysis  
              ...(analyzeSentiment && {
                sentiment: await this.analyzeSentiment(chunk.text)
              }),

              // Document structure analysis
              structure_type: this.analyzeDocumentStructure(chunk.text),
              information_density: this.calculateInformationDensity(chunk.text)
            },

            created_at: doc.created_at,
            updated_at: new Date(),
            processing_version: '2.0'
          };

          processedDocuments.push(chunkDoc);
        }

      } catch (error) {
        console.error(`Error processing document ${doc._id}:`, error);
        continue;
      }
    }

    console.log(`Document processing completed: ${processedDocuments.length} chunks created from ${documents.length} documents`);
    return processedDocuments;
  }

  async performSemanticChunking(text, targetSize, overlap) {
    // Implement semantic-aware chunking that preserves meaning
    const sentences = this.splitIntoSentences(text);
    const chunks = [];
    let currentChunk = '';
    let currentWordCount = 0;
    let startPosition = 0;

    for (const sentence of sentences) {
      const sentenceWordCount = sentence.split(/\s+/).length;

      if (currentWordCount + sentenceWordCount > targetSize && currentChunk.length > 0) {
        // Create chunk with semantic coherence
        chunks.push({
          text: currentChunk.trim(),
          word_count: currentWordCount,
          sentence_count: currentChunk.split(/[.!?]+/).length - 1,
          start_position: startPosition,
          end_position: startPosition + currentChunk.length,
          semantic_density: await this.calculateSemanticDensity(currentChunk)
        });

        // Start new chunk with overlap
        const overlapText = this.extractOverlapText(currentChunk, overlap);
        currentChunk = overlapText + ' ' + sentence;
        currentWordCount = this.countWords(currentChunk);
        startPosition += currentChunk.length - overlapText.length;
      } else {
        currentChunk += (currentChunk ? ' ' : '') + sentence;
        currentWordCount += sentenceWordCount;
      }
    }

    // Add final chunk
    if (currentChunk.trim().length > 0) {
      chunks.push({
        text: currentChunk.trim(),
        word_count: currentWordCount,
        sentence_count: currentChunk.split(/[.!?]+/).length - 1,
        start_position: startPosition,
        end_position: startPosition + currentChunk.length,
        semantic_density: await this.calculateSemanticDensity(currentChunk)
      });
    }

    return chunks;
  }

  async buildConversationalRAG(conversationHistory, currentQuery, options = {}) {
    console.log('Building conversational RAG system...');

    const {
      contextWindow = 5,
      includeConversationContext = true,
      personalizeResponse = true,
      userId = null
    } = options;

    try {
      // Step 1: Build conversational context
      let enhancedQuery = currentQuery;

      if (includeConversationContext && conversationHistory.length > 0) {
        const recentContext = conversationHistory.slice(-contextWindow);
        const contextSummary = recentContext.map(turn => 
          `${turn.role}: ${turn.content}`
        ).join('\n');

        enhancedQuery = `Previous conversation context:\n${contextSummary}\n\nCurrent question: ${currentQuery}`;
      }

      // Step 2: Generate enhanced query embedding
      const queryEmbedding = await this.generateEmbedding(enhancedQuery);

      // Step 3: Personalized retrieval if user profile available
      let userProfile = null;
      if (personalizeResponse && userId) {
        userProfile = await this.getUserProfile(userId);
      }

      // Step 4: Perform contextual search
      const searchResults = await this.atlasManager.performSemanticSearch(queryEmbedding, {
        limit: 8,
        userProfile: userProfile,
        boostFactors: {
          recency: 0.2,
          quality: 0.3,
          personalization: 0.2
        }
      });

      // Step 5: Build conversational RAG response
      const ragResponse = await this.atlasManager.buildRAGPipeline(enhancedQuery, {
        contextLimit: 6,
        maxContextLength: 5000,
        embeddingFunction: (texts) => Promise.resolve([queryEmbedding]),
        llmFunction: this.createConversationalLLMFunction(conversationHistory),
        includeSourceCitations: true
      });

      // Step 6: Post-process for conversation continuity
      if (ragResponse.success) {
        ragResponse.conversation_metadata = {
          context_turns_used: Math.min(contextWindow, conversationHistory.length),
          personalized: !!userProfile,
          query_enhanced: includeConversationContext,
          user_id: userId
        };
      }

      return ragResponse;

    } catch (error) {
      console.error('Conversational RAG error:', error);
      return {
        success: false,
        error: error.message,
        query: currentQuery
      };
    }
  }

  createConversationalLLMFunction(conversationHistory) {
    return async (prompt, options = {}) => {
      // Add conversation-aware instructions
      const conversationalPrompt = `You are a helpful assistant engaged in an ongoing conversation. 

Previous conversation context has been provided. Use this context to:
- Maintain conversation continuity
- Reference previous topics when relevant
- Provide contextually appropriate responses
- Acknowledge when building on previous answers

${prompt}

Remember to be conversational and reference the ongoing dialogue when appropriate.`;

      // This would integrate with your preferred LLM service
      return await this.callLLMService(conversationalPrompt, options);
    };
  }

  async implementRecommendationSystem(userId, options = {}) {
    console.log(`Building recommendation system for user ${userId}...`);

    const {
      recommendationType = 'content',
      diversityFactor = 0.3,
      noveltyBoost = 0.2,
      limit = 10
    } = options;

    try {
      // Step 1: Get user profile and interaction history
      const userProfile = await this.getUserProfile(userId);
      const interactionHistory = await this.getUserInteractions(userId);

      // Step 2: Build user preference embedding
      const userPreferenceEmbedding = await this.buildUserPreferenceEmbedding(
        userProfile, 
        interactionHistory
      );

      // Step 3: Find similar content
      const candidateResults = await this.atlasManager.performSemanticSearch(
        userPreferenceEmbedding,
        {
          limit: limit * 3, // Get more candidates for diversity
          similarityThreshold: 0.4
        }
      );

      // Step 4: Apply diversity and novelty filtering
      const diversifiedResults = this.applyDiversityFiltering(
        candidateResults.results,
        interactionHistory,
        diversityFactor,
        noveltyBoost
      );

      // Step 5: Rank final recommendations
      const finalRecommendations = diversifiedResults.slice(0, limit).map((rec, index) => ({
        ...rec,
        recommendation_rank: index + 1,
        recommendation_score: rec.final_score,
        recommendation_reasons: this.generateRecommendationReasons(rec, userProfile)
      }));

      return {
        success: true,
        user_id: userId,
        recommendations: finalRecommendations,
        recommendation_metadata: {
          algorithm: 'vector_similarity_with_diversity',
          diversity_factor: diversityFactor,
          novelty_boost: noveltyBoost,
          candidates_evaluated: candidateResults.results?.length || 0,
          final_count: finalRecommendations.length
        }
      };

    } catch (error) {
      console.error('Recommendation system error:', error);
      return {
        success: false,
        error: error.message,
        user_id: userId
      };
    }
  }

  applyDiversityFiltering(candidates, userHistory, diversityFactor, noveltyBoost) {
    // Track categories and topics to ensure diversity
    const categoryCount = new Map();
    const diversifiedResults = [];

    // Get user's previously interacted content for novelty scoring
    const previouslyViewed = new Set(
      userHistory.map(interaction => interaction.document_id?.toString())
    );

    for (const candidate of candidates) {
      const category = candidate.metadata?.category || 'unknown';
      const currentCategoryCount = categoryCount.get(category) || 0;

      // Calculate diversity penalty (more items in category = higher penalty)
      const diversityPenalty = currentCategoryCount * diversityFactor;

      // Calculate novelty boost (unseen content gets boost)
      const noveltyScore = previouslyViewed.has(candidate._id.toString()) ? 0 : noveltyBoost;

      // Apply adjustments to final score
      candidate.final_score = (candidate.final_score || candidate.similarity_score) - diversityPenalty + noveltyScore;
      candidate.diversity_penalty = diversityPenalty;
      candidate.novelty_boost = noveltyScore;

      diversifiedResults.push(candidate);
      categoryCount.set(category, currentCategoryCount + 1);
    }

    return diversifiedResults.sort((a, b) => b.final_score - a.final_score);
  }

  generateRecommendationReasons(recommendation, userProfile) {
    const reasons = [];

    if (userProfile.preferred_categories?.includes(recommendation.metadata?.category)) {
      reasons.push(`Matches your interest in ${recommendation.metadata.category}`);
    }

    if (recommendation.similarity_score > 0.8) {
      reasons.push('Highly relevant to your preferences');
    }

    if (recommendation.novelty_boost > 0) {
      reasons.push('New content you haven\'t seen');
    }

    if (recommendation.metadata?.quality_score > 0.8) {
      reasons.push('High-quality content');
    }

    return reasons.length > 0 ? reasons : ['Recommended based on your profile'];
  }

  // Utility methods
  splitIntoSentences(text) {
    return text.split(/[.!?]+/).filter(s => s.trim().length > 0);
  }

  extractOverlapText(text, overlapSize) {
    const words = text.split(/\s+/);
    return words.slice(-overlapSize).join(' ');
  }

  countWords(text) {
    return text.split(/\s+/).filter(word => word.length > 0).length;
  }

  async calculateSemanticDensity(text) {
    // Simplified semantic density calculation
    const sentences = this.splitIntoSentences(text);
    const avgSentenceLength = text.length / sentences.length;
    const wordCount = this.countWords(text);

    // Higher density = more information per word
    return Math.min(1.0, (avgSentenceLength / 100) * (wordCount / 500));
  }

  analyzeDocumentStructure(text) {
    if (text.includes('```') || text.includes('function') || text.includes('class')) return 'code';
    if (text.match(/^\d+\./m) || text.includes('Step')) return 'procedural';
    if (text.includes('?') && text.split('?').length > 2) return 'faq';
    return 'narrative';
  }

  calculateInformationDensity(text) {
    const uniqueWords = new Set(text.toLowerCase().match(/\b\w+\b/g) || []);
    const totalWords = this.countWords(text);
    return totalWords > 0 ? uniqueWords.size / totalWords : 0;
  }
}

SQL-Style Vector Search Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB Atlas Vector Search operations:

-- QueryLeaf vector search operations with SQL-familiar syntax

-- Create vector search enabled collection
CREATE COLLECTION documents_with_vectors (
  _id OBJECTID PRIMARY KEY,
  title VARCHAR(500) NOT NULL,
  content TEXT NOT NULL,

  -- Vector embedding field
  embedding VECTOR(1536) NOT NULL, -- OpenAI embedding dimensions

  -- Metadata for filtering
  category VARCHAR(100),
  language VARCHAR(10) DEFAULT 'en',
  source VARCHAR(100),
  tags VARCHAR[] DEFAULT ARRAY[]::VARCHAR[],
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

  -- Document analysis fields
  word_count INTEGER,
  reading_time_minutes INTEGER,
  quality_score DECIMAL(3,2) DEFAULT 0.5,

  -- Full-text search support
  searchable_text TEXT GENERATED ALWAYS AS (title || ' ' || content) STORED
);

-- Create Atlas Vector Search index
CREATE VECTOR INDEX document_semantic_search ON documents_with_vectors (
  embedding USING cosine_similarity
  WITH FILTER FIELDS (category, language, source, created_at, tags)
);

-- Create hybrid search index for text + vector
CREATE SEARCH INDEX document_hybrid_search ON documents_with_vectors (
  title WITH lucene_analyzer('standard'),
  content WITH lucene_analyzer('english'),
  category WITH string_facet(),
  tags WITH string_facet()
);

-- Semantic vector search with SQL syntax
SELECT 
  _id,
  title,
  LEFT(content, 300) as content_preview,
  category,
  source,
  created_at,

  -- Vector similarity score
  VECTOR_SIMILARITY(embedding, $1::VECTOR(1536), 'cosine') as similarity_score,

  -- Relevance classification
  CASE 
    WHEN VECTOR_SIMILARITY(embedding, $1, 'cosine') >= 0.9 THEN 'highly_relevant'
    WHEN VECTOR_SIMILARITY(embedding, $1, 'cosine') >= 0.8 THEN 'relevant'
    WHEN VECTOR_SIMILARITY(embedding, $1, 'cosine') >= 0.7 THEN 'somewhat_relevant'
    ELSE 'marginally_relevant'
  END as relevance_category,

  -- Quality-adjusted ranking score
  VECTOR_SIMILARITY(embedding, $1, 'cosine') * (1 + quality_score * 0.2) as final_score

FROM documents_with_vectors
WHERE 
  -- Vector similarity threshold
  VECTOR_SIMILARITY(embedding, $1, 'cosine') >= $2::DECIMAL -- similarity threshold parameter

  -- Optional metadata filtering
  AND ($3::VARCHAR[] IS NULL OR category = ANY($3)) -- categories filter
  AND ($4::VARCHAR IS NULL OR language = $4) -- language filter  
  AND ($5::VARCHAR IS NULL OR source = $5) -- source filter
  AND ($6::VARCHAR[] IS NULL OR tags && $6) -- tags overlap filter
  AND ($7::TIMESTAMP IS NULL OR created_at >= $7) -- date filter

ORDER BY final_score DESC, similarity_score DESC
LIMIT $8::INTEGER; -- result limit

-- Advanced hybrid search combining vector and text similarity
WITH vector_search AS (
  SELECT 
    _id, title, content, category, source, created_at,
    VECTOR_SIMILARITY(embedding, $1::VECTOR(1536), 'cosine') as vector_score
  FROM documents_with_vectors
  WHERE VECTOR_SIMILARITY(embedding, $1, 'cosine') >= 0.6
  ORDER BY vector_score DESC
  LIMIT 20
),

text_search AS (
  SELECT 
    _id, title, content, category, source, created_at,
    SEARCH_SCORE() as text_score,
    SEARCH_HIGHLIGHTS('content', 3) as highlighted_content
  FROM documents_with_vectors
  WHERE MATCH(searchable_text, $2::TEXT) -- text query parameter
    WITH search_options(
      fuzzy_max_edits = 2,
      fuzzy_prefix_length = 3,
      highlight_max_chars = 1000
    )
  ORDER BY text_score DESC
  LIMIT 20
),

hybrid_results AS (
  SELECT 
    COALESCE(vs._id, ts._id) as _id,
    COALESCE(vs.title, ts.title) as title,
    COALESCE(vs.content, ts.content) as content,
    COALESCE(vs.category, ts.category) as category,
    COALESCE(vs.source, ts.source) as source,
    COALESCE(vs.created_at, ts.created_at) as created_at,

    -- Normalize scores to 0-1 range
    COALESCE(vs.vector_score, 0) / (SELECT MAX(vector_score) FROM vector_search) as normalized_vector_score,
    COALESCE(ts.text_score, 0) / (SELECT MAX(text_score) FROM text_search) as normalized_text_score,

    -- Hybrid scoring with configurable weights
    ($3::DECIMAL * COALESCE(vs.vector_score, 0) / (SELECT MAX(vector_score) FROM vector_search)) + 
    ($4::DECIMAL * COALESCE(ts.text_score, 0) / (SELECT MAX(text_score) FROM text_search)) as hybrid_score,

    ts.highlighted_content,

    -- Search type classification
    CASE 
      WHEN vs._id IS NOT NULL AND ts._id IS NOT NULL THEN 'both'
      WHEN vs._id IS NOT NULL THEN 'vector_only'
      ELSE 'text_only'
    END as search_type

  FROM vector_search vs
  FULL OUTER JOIN text_search ts ON vs._id = ts._id
)

SELECT 
  _id,
  title,
  LEFT(content, 400) as content_preview,
  category,
  source,
  created_at,

  -- Scores
  ROUND(normalized_vector_score::NUMERIC, 4) as vector_similarity,
  ROUND(normalized_text_score::NUMERIC, 4) as text_relevance, 
  ROUND(hybrid_score::NUMERIC, 4) as final_score,

  search_type,
  highlighted_content,

  -- Content insights
  CASE 
    WHEN hybrid_score >= 0.8 THEN 'excellent_match'
    WHEN hybrid_score >= 0.6 THEN 'good_match' 
    WHEN hybrid_score >= 0.4 THEN 'fair_match'
    ELSE 'weak_match'
  END as match_quality

FROM hybrid_results
ORDER BY hybrid_score DESC, normalized_vector_score DESC
LIMIT $5::INTEGER; -- final result limit

-- Retrieval-Augmented Generation (RAG) pipeline with QueryLeaf
WITH context_retrieval AS (
  SELECT 
    _id,
    title,
    content,
    category,
    VECTOR_SIMILARITY(embedding, $1::VECTOR(1536), 'cosine') as relevance_score
  FROM documents_with_vectors
  WHERE VECTOR_SIMILARITY(embedding, $1, 'cosine') >= 0.7
  ORDER BY relevance_score DESC
  LIMIT 5
),

context_preparation AS (
  SELECT 
    STRING_AGG(
      '[' || ROW_NUMBER() OVER (ORDER BY relevance_score DESC) || '] ' || 
      title || E'\n' || LEFT(content, 500) || '...',
      E'\n\n'
      ORDER BY relevance_score DESC
    ) as context_string,

    COUNT(*) as context_documents,
    AVG(relevance_score) as avg_relevance,

    JSON_AGG(
      JSON_BUILD_OBJECT(
        'id', ROW_NUMBER() OVER (ORDER BY relevance_score DESC),
        'title', title,
        'category', category,
        'relevance', ROUND(relevance_score::NUMERIC, 4)
      ) ORDER BY relevance_score DESC
    ) as source_citations

  FROM context_retrieval
)

SELECT 
  context_string,
  context_documents,
  ROUND(avg_relevance::NUMERIC, 4) as average_context_relevance,
  source_citations,

  -- RAG prompt construction
  'You are a helpful assistant that answers questions based on provided context. ' ||
  'Use the following context information to provide accurate answers.' || E'\n\n' ||
  'Context Information:' || E'\n' || context_string || E'\n\n' ||
  'Question: ' || $2::TEXT || E'\n\n' ||
  'Instructions:' || E'\n' ||
  '- Answer based solely on the provided context' || E'\n' ||  
  '- Include source citations using [number] format' || E'\n' ||
  '- If context is insufficient, clearly state what information is missing' || E'\n\n' ||
  'Answer:' as rag_prompt,

  -- Query metadata
  $2::TEXT as original_query,
  CURRENT_TIMESTAMP as generated_at

FROM context_preparation;

-- User preference-based semantic search and recommendations  
WITH user_profile AS (
  SELECT 
    user_id,
    preference_embedding,
    preferred_categories,
    preferred_languages,
    interaction_history,
    last_active
  FROM user_profiles
  WHERE user_id = $1::UUID
),

personalized_search AS (
  SELECT 
    d._id,
    d.title,
    d.content,
    d.category,
    d.source,
    d.created_at,
    d.quality_score,

    -- Semantic similarity to user preferences
    VECTOR_SIMILARITY(d.embedding, up.preference_embedding, 'cosine') as preference_similarity,

    -- Category preference boost
    CASE 
      WHEN d.category = ANY(up.preferred_categories) THEN 1.2
      ELSE 1.0
    END as category_boost,

    -- Novelty boost (content user hasn't seen)
    CASE 
      WHEN d._id = ANY(up.interaction_history) THEN 0.8 -- Reduce score for seen content
      ELSE 1.1 -- Boost novel content
    END as novelty_boost,

    -- Recency factor
    CASE 
      WHEN d.created_at >= CURRENT_DATE - INTERVAL '7 days' THEN 1.1
      WHEN d.created_at >= CURRENT_DATE - INTERVAL '30 days' THEN 1.05
      ELSE 1.0  
    END as recency_boost

  FROM documents_with_vectors d
  CROSS JOIN user_profile up
  WHERE VECTOR_SIMILARITY(d.embedding, up.preference_embedding, 'cosine') >= 0.5
    AND (up.preferred_languages IS NULL OR d.language = ANY(up.preferred_languages))
),

ranked_recommendations AS (
  SELECT *,
    -- Calculate final personalized score
    preference_similarity * category_boost * novelty_boost * recency_boost * (1 + quality_score * 0.3) as personalized_score,

    -- Diversity scoring to avoid over-concentration in single category
    ROW_NUMBER() OVER (PARTITION BY category ORDER BY preference_similarity DESC) as category_rank

  FROM personalized_search
),

diversified_recommendations AS (
  SELECT *,
    -- Apply diversity penalty for category concentration
    CASE 
      WHEN category_rank <= 2 THEN personalized_score
      WHEN category_rank <= 4 THEN personalized_score * 0.9
      ELSE personalized_score * 0.7
    END as final_recommendation_score

  FROM ranked_recommendations
)

SELECT 
  _id,
  title,
  LEFT(content, 300) as content_preview,
  category,
  source,
  created_at,

  -- Recommendation scores
  ROUND(preference_similarity::NUMERIC, 4) as user_preference_match,
  ROUND(personalized_score::NUMERIC, 4) as personalized_relevance,
  ROUND(final_recommendation_score::NUMERIC, 4) as recommendation_score,

  -- Recommendation explanations
  CASE 
    WHEN category_boost > 1.0 AND novelty_boost > 1.0 THEN 'New content in your preferred categories'
    WHEN category_boost > 1.0 THEN 'Matches your category preferences'
    WHEN novelty_boost > 1.0 THEN 'New content you might find interesting'
    WHEN recency_boost > 1.0 THEN 'Recently published content'
    ELSE 'Recommended based on your preferences'
  END as recommendation_reason,

  -- Quality indicators
  CASE 
    WHEN quality_score >= 0.8 AND preference_similarity >= 0.8 THEN 'high_confidence'
    WHEN quality_score >= 0.6 AND preference_similarity >= 0.6 THEN 'medium_confidence'
    ELSE 'exploratory'
  END as confidence_level

FROM diversified_recommendations
ORDER BY final_recommendation_score DESC, preference_similarity DESC  
LIMIT $2::INTEGER; -- recommendation count limit

-- Real-time vector search analytics and performance monitoring
CREATE MATERIALIZED VIEW vector_search_analytics AS
WITH search_performance AS (
  SELECT 
    DATE_TRUNC('hour', search_timestamp) as hour_bucket,
    search_type, -- 'vector', 'text', 'hybrid'

    -- Performance metrics
    COUNT(*) as search_count,
    AVG(search_duration_ms) as avg_search_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY search_duration_ms) as p95_search_time,
    AVG(result_count) as avg_results_returned,

    -- Quality metrics  
    AVG(avg_similarity_score) as avg_result_relevance,
    COUNT(*) FILTER (WHERE avg_similarity_score >= 0.8) as high_relevance_searches,
    COUNT(*) FILTER (WHERE result_count = 0) as zero_result_searches,

    -- User interaction metrics
    COUNT(DISTINCT user_id) as unique_users,
    AVG(user_interaction_score) as avg_user_satisfaction

  FROM search_logs
  WHERE search_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  GROUP BY DATE_TRUNC('hour', search_timestamp), search_type
),

embedding_performance AS (
  SELECT 
    DATE_TRUNC('hour', created_at) as hour_bucket,
    embedding_model,

    -- Embedding metrics
    COUNT(*) as embeddings_generated,
    AVG(embedding_generation_time_ms) as avg_embedding_time,
    AVG(ARRAY_LENGTH(embedding, 1)) as avg_dimensions -- Vector dimension validation

  FROM documents_with_vectors
  WHERE created_at >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  GROUP BY DATE_TRUNC('hour', created_at), embedding_model
)

SELECT 
  sp.hour_bucket,
  sp.search_type,

  -- Volume metrics
  sp.search_count,
  sp.unique_users,
  ROUND((sp.search_count::DECIMAL / sp.unique_users)::NUMERIC, 2) as searches_per_user,

  -- Performance metrics
  ROUND(sp.avg_search_time::NUMERIC, 2) as avg_search_time_ms,
  ROUND(sp.p95_search_time::NUMERIC, 2) as p95_search_time_ms,
  sp.avg_results_returned,

  -- Quality metrics
  ROUND(sp.avg_result_relevance::NUMERIC, 3) as avg_relevance_score,
  ROUND((sp.high_relevance_searches::DECIMAL / sp.search_count * 100)::NUMERIC, 1) as high_relevance_rate_pct,
  ROUND((sp.zero_result_searches::DECIMAL / sp.search_count * 100)::NUMERIC, 1) as zero_results_rate_pct,

  -- User satisfaction
  ROUND(sp.avg_user_satisfaction::NUMERIC, 2) as user_satisfaction_score,

  -- Embedding performance (when available)
  ep.embeddings_generated,
  ep.avg_embedding_time,

  -- Health indicators
  CASE 
    WHEN sp.avg_search_time <= 100 AND sp.avg_result_relevance >= 0.7 THEN 'healthy'
    WHEN sp.avg_search_time <= 500 AND sp.avg_result_relevance >= 0.5 THEN 'acceptable'
    ELSE 'needs_attention'
  END as system_health_status,

  -- Recommendations
  CASE 
    WHEN sp.zero_result_searches::DECIMAL / sp.search_count > 0.1 THEN 'Improve embedding coverage'
    WHEN sp.avg_search_time > 1000 THEN 'Optimize vector indexes'
    WHEN sp.avg_result_relevance < 0.6 THEN 'Review similarity thresholds'
    ELSE 'Performance within targets'
  END as optimization_recommendation

FROM search_performance sp
LEFT JOIN embedding_performance ep ON sp.hour_bucket = ep.hour_bucket
ORDER BY sp.hour_bucket DESC, sp.search_type;

-- QueryLeaf provides comprehensive Atlas Vector Search capabilities:
-- 1. SQL-familiar vector search syntax with similarity functions
-- 2. Advanced hybrid search combining vector and full-text capabilities  
-- 3. Built-in RAG pipeline construction with context retrieval and ranking
-- 4. Personalized recommendation systems with user preference integration
-- 5. Real-time analytics and performance monitoring for vector operations
-- 6. Automatic embedding management and vector index optimization
-- 7. Conversational AI support with context-aware search capabilities
-- 8. Production-scale vector search with filtering and metadata integration
-- 9. Comprehensive search quality metrics and optimization recommendations
-- 10. Native integration with MongoDB Atlas Vector Search infrastructure

Best Practices for Atlas Vector Search Implementation

Vector Index Design and Optimization

Essential practices for production Atlas Vector Search deployments:

Vector Dimensionality: Choose embedding dimensions based on model requirements and performance constraints
Similarity Metrics: Select appropriate similarity functions (cosine, euclidean, dot product) for your use case
Index Configuration: Configure vector indexes with optimal numCandidates and filter field selections
Metadata Strategy: Design metadata schemas that enable efficient filtering during vector search
Embedding Quality: Implement embedding generation strategies that capture semantic meaning effectively
Performance Monitoring: Deploy comprehensive monitoring for search latency, accuracy, and user satisfaction

Production AI Application Patterns

Optimize Atlas Vector Search for real-world AI applications:

Hybrid Search: Combine vector similarity with traditional search for comprehensive results
RAG Optimization: Implement context selection strategies that balance relevance and diversity
Real-time Updates: Design pipelines for incremental embedding updates and index maintenance
Personalization: Build user preference models that enhance search relevance
Cost Management: Optimize embedding generation and storage costs through intelligent caching
Security Integration: Implement proper authentication and access controls for vector data

Conclusion

MongoDB Atlas Vector Search provides a comprehensive platform for building modern AI applications that require sophisticated semantic search capabilities. By integrating vector search directly into MongoDB's document model, developers can build powerful AI systems without the complexity of managing separate vector databases.

Key Atlas Vector Search benefits include:

Native Integration: Seamless combination of document operations and vector search in a single platform
Scalable Architecture: Built on MongoDB Atlas infrastructure with automatic scaling and management
Hybrid Capabilities: Advanced search patterns combining vector similarity with traditional text search
AI-Ready Features: Built-in support for RAG pipelines, personalization, and conversational AI
Production Optimized: Enterprise-grade security, monitoring, and performance optimization
Developer Friendly: Familiar MongoDB query patterns extended with vector search capabilities

Whether you're building recommendation systems, semantic search engines, RAG-powered chatbots, or other AI applications, MongoDB Atlas Vector Search with QueryLeaf's SQL-familiar interface provides the foundation for modern AI-powered applications that scale efficiently and maintain high performance.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB Atlas Vector Search operations while providing SQL-familiar syntax for semantic search, hybrid search patterns, and RAG pipeline construction. Advanced vector search capabilities, personalization systems, and AI application patterns are seamlessly accessible through familiar SQL constructs, making sophisticated AI development both powerful and approachable for SQL-oriented teams.

The combination of MongoDB's flexible document model with advanced vector search capabilities makes it an ideal platform for AI applications that require both semantic understanding and operational flexibility, ensuring your AI systems can evolve with advancing technology while maintaining familiar development patterns.

November 11, 2025
26 min read

MongoDB Index Optimization and Query Performance Tuning: Advanced Database Performance Engineering

Modern enterprise applications demand exceptional database performance to support millions of users, complex queries, and real-time analytics workloads. Traditional approaches to database performance optimization often rely on rigid indexing strategies, manual query tuning, and reactive performance monitoring that fails to scale with growing data volumes and evolving access patterns.

MongoDB's flexible indexing system provides comprehensive performance optimization capabilities that combine intelligent index selection, advanced compound indexing strategies, and sophisticated query execution analysis. Unlike traditional database systems that require extensive manual tuning, MongoDB's index optimization features enable proactive performance management with automated recommendations, flexible indexing patterns, and detailed performance analytics.

The Traditional Database Performance Challenge

Relational database performance optimization has significant complexity and maintenance overhead:

-- Traditional PostgreSQL performance optimization - complex and manual

-- Customer orders table with performance challenges
CREATE TABLE customer_orders (
    order_id BIGSERIAL PRIMARY KEY,
    customer_id BIGINT NOT NULL,
    order_date TIMESTAMP NOT NULL,
    order_status VARCHAR(20) NOT NULL,
    total_amount DECIMAL(12,2) NOT NULL,
    shipping_address_id BIGINT,
    billing_address_id BIGINT,
    payment_method VARCHAR(50),
    shipping_method VARCHAR(50),
    order_priority VARCHAR(20) DEFAULT 'standard',
    sales_rep_id BIGINT,

    -- Additional fields for complex queries
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    completed_at TIMESTAMP,
    cancelled_at TIMESTAMP,

    -- Foreign key constraints
    CONSTRAINT fk_customer FOREIGN KEY (customer_id) REFERENCES customers(customer_id),
    CONSTRAINT fk_shipping_address FOREIGN KEY (shipping_address_id) REFERENCES addresses(address_id),
    CONSTRAINT fk_billing_address FOREIGN KEY (billing_address_id) REFERENCES addresses(address_id),
    CONSTRAINT fk_sales_rep FOREIGN KEY (sales_rep_id) REFERENCES employees(employee_id)
);

-- Order items table for line-level details
CREATE TABLE order_items (
    item_id BIGSERIAL PRIMARY KEY,
    order_id BIGINT NOT NULL,
    product_id BIGINT NOT NULL,
    quantity INTEGER NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL,
    discount_amount DECIMAL(10,2) DEFAULT 0.00,
    tax_amount DECIMAL(10,2) NOT NULL,

    CONSTRAINT fk_order FOREIGN KEY (order_id) REFERENCES customer_orders(order_id),
    CONSTRAINT fk_product FOREIGN KEY (product_id) REFERENCES products(product_id)
);

-- Manual index creation - requires extensive analysis and planning
-- Basic indexes for common queries
CREATE INDEX idx_orders_customer_id ON customer_orders(customer_id);
CREATE INDEX idx_orders_order_date ON customer_orders(order_date);
CREATE INDEX idx_orders_status ON customer_orders(order_status);

-- Compound indexes for complex query patterns
CREATE INDEX idx_orders_customer_date ON customer_orders(customer_id, order_date DESC);
CREATE INDEX idx_orders_status_date ON customer_orders(order_status, order_date DESC);
CREATE INDEX idx_orders_rep_status ON customer_orders(sales_rep_id, order_status, order_date DESC);

-- Partial indexes for selective filtering
CREATE INDEX idx_orders_completed_recent ON customer_orders(completed_at, total_amount) 
    WHERE order_status = 'completed' AND completed_at >= CURRENT_DATE - INTERVAL '90 days';

-- Covering indexes for query optimization (include columns)
CREATE INDEX idx_orders_customer_covering ON customer_orders(customer_id, order_date DESC) 
    INCLUDE (order_status, total_amount, payment_method);

-- Complex multi-table query requiring careful index planning
SELECT 
    o.order_id,
    o.order_date,
    o.total_amount,
    o.order_status,
    c.customer_name,
    c.customer_email,

    -- Aggregated order items (expensive without proper indexes)
    COUNT(oi.item_id) as item_count,
    SUM(oi.quantity * oi.unit_price) as items_subtotal,
    SUM(oi.discount_amount) as total_discount,
    SUM(oi.tax_amount) as total_tax,

    -- Product information (requires additional joins)
    array_agg(DISTINCT p.product_name) as product_names,
    array_agg(DISTINCT p.category) as product_categories,

    -- Address information (more joins)
    sa.street_address as shipping_street,
    sa.city as shipping_city,
    sa.state as shipping_state,

    -- Employee information
    e.first_name || ' ' || e.last_name as sales_rep_name

FROM customer_orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
LEFT JOIN addresses sa ON o.shipping_address_id = sa.address_id
LEFT JOIN employees e ON o.sales_rep_id = e.employee_id

WHERE 
    o.order_date >= CURRENT_DATE - INTERVAL '30 days'
    AND o.order_status IN ('processing', 'shipped', 'delivered')
    AND o.total_amount >= 100.00
    AND c.customer_tier IN ('premium', 'enterprise')

GROUP BY 
    o.order_id, o.order_date, o.total_amount, o.order_status,
    c.customer_name, c.customer_email,
    sa.street_address, sa.city, sa.state,
    e.first_name, e.last_name

HAVING COUNT(oi.item_id) >= 2

ORDER BY o.order_date DESC, o.total_amount DESC
LIMIT 100;

-- Analyze query performance (complex interpretation required)
EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) 
SELECT o.order_id, o.total_amount, c.customer_name
FROM customer_orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '7 days'
    AND o.order_status = 'completed'
    AND c.customer_tier = 'premium'
ORDER BY o.total_amount DESC
LIMIT 50;

-- Performance monitoring queries (complex and manual)
SELECT 
    schemaname,
    tablename,
    attname as column_name,
    n_distinct,
    correlation,
    most_common_vals,
    most_common_freqs
FROM pg_stats 
WHERE schemaname = 'public' 
    AND tablename IN ('customer_orders', 'order_items')
ORDER BY tablename, attname;

-- Index usage statistics
SELECT 
    schemaname,
    tablename,
    indexname,
    idx_tup_read,
    idx_tup_fetch,
    idx_scan,

    -- Index efficiency calculation
    CASE 
        WHEN idx_scan > 0 THEN ROUND((idx_tup_fetch::numeric / idx_scan), 2)
        ELSE 0 
    END as avg_tuples_per_scan,

    -- Index selectivity (estimated)
    CASE 
        WHEN idx_tup_read > 0 THEN ROUND((idx_tup_fetch::numeric / idx_tup_read) * 100, 2)
        ELSE 0 
    END as selectivity_percent

FROM pg_stat_user_indexes 
WHERE schemaname = 'public'
ORDER BY idx_scan DESC;

-- Problems with traditional PostgreSQL performance optimization:
-- 1. Manual index design requires deep expertise and continuous maintenance
-- 2. Query plan analysis is complex and difficult to interpret
-- 3. Index maintenance overhead grows with data volume
-- 4. Limited support for dynamic query patterns and evolving schemas
-- 5. Difficult to optimize across multiple tables and complex joins
-- 6. Performance monitoring requires custom scripts and manual interpretation
-- 7. Index selection strategies are static and don't adapt to changing workloads
-- 8. Covering index management is complex and error-prone
-- 9. Partial index design requires detailed knowledge of data distribution
-- 10. Limited automated recommendations for performance improvements

MongoDB provides comprehensive performance optimization with intelligent indexing:

// MongoDB Index Optimization - intelligent and automated performance tuning
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('ecommerce_platform');

// Advanced Index Management and Optimization
class MongoDBIndexOptimizer {
  constructor(db) {
    this.db = db;
    this.performanceMetrics = new Map();
    this.indexRecommendations = new Map();
  }

  async createOptimizedCollections() {
    console.log('Creating optimized collections with intelligent indexing...');

    // Orders collection with comprehensive document structure
    const ordersCollection = db.collection('orders');

    // Sample order document structure for index planning
    const sampleOrder = {
      _id: new ObjectId(),
      orderNumber: "ORD-2025-001234",

      // Customer information (embedded for performance)
      customer: {
        customerId: new ObjectId("64a1b2c3d4e5f6789012345a"),
        name: "John Doe",
        email: "john.doe@example.com",
        tier: "premium", // standard, premium, enterprise
        accountType: "individual" // individual, business
      },

      // Order details
      orderDate: new Date("2025-11-11T10:30:00Z"),
      status: "processing", // pending, processing, shipped, delivered, cancelled
      priority: "standard", // standard, expedited, overnight

      // Financial information
      totals: {
        subtotal: 299.97,
        tax: 24.00,
        shipping: 12.99,
        discount: 15.00,
        grandTotal: 321.96,
        currency: "USD"
      },

      // Items array (embedded for query performance)
      items: [
        {
          productId: new ObjectId("64b2c3d4e5f6789012345b1a"),
          sku: "WIDGET-001",
          name: "Premium Widget",
          category: "electronics",
          subcategory: "gadgets",
          quantity: 2,
          unitPrice: 99.99,
          totalPrice: 199.98,

          // Product attributes for filtering
          attributes: {
            brand: "TechCorp",
            model: "WG-2024",
            color: "black",
            size: null,
            weight: 1.2
          }
        },
        {
          productId: new ObjectId("64b2c3d4e5f6789012345b1b"),
          sku: "ACCESSORY-001", 
          name: "Widget Accessory",
          category: "electronics",
          subcategory: "accessories",
          quantity: 1,
          unitPrice: 99.99,
          totalPrice: 99.99,

          attributes: {
            brand: "TechCorp",
            model: "AC-2024",
            color: "silver",
            compatibility: ["WG-2024", "WG-2023"]
          }
        }
      ],

      // Address information
      addresses: {
        shipping: {
          name: "John Doe",
          street: "123 Main Street",
          city: "San Francisco",
          state: "CA",
          postalCode: "94105",
          country: "US",
          coordinates: {
            latitude: 37.7749,
            longitude: -122.4194
          }
        },

        billing: {
          name: "John Doe",
          street: "123 Main Street", 
          city: "San Francisco",
          state: "CA",
          postalCode: "94105",
          country: "US"
        }
      },

      // Payment information
      payment: {
        method: "credit_card", // credit_card, debit_card, paypal, etc.
        provider: "stripe",
        transactionId: "txn_1234567890",
        status: "captured" // pending, authorized, captured, failed
      },

      // Shipping information
      shipping: {
        method: "standard", // standard, expedited, overnight
        carrier: "UPS",
        trackingNumber: "1Z12345E1234567890",
        estimatedDelivery: new Date("2025-11-15T18:00:00Z"),
        actualDelivery: null
      },

      // Sales and marketing
      salesInfo: {
        salesRepId: new ObjectId("64c3d4e5f67890123456c2a"),
        salesRepName: "Jane Smith",
        channel: "online", // online, phone, in_store
        source: "organic", // organic, paid_search, social, email
        campaign: "holiday_2025"
      },

      // Operational metadata
      fulfillment: {
        warehouseId: "WH-SF-001",
        pickingStarted: null,
        pickingCompleted: null,
        packingStarted: null,
        packingCompleted: null,
        shippedAt: null
      },

      // Analytics and business intelligence
      analytics: {
        customerLifetimeValue: 1250.00,
        orderFrequency: "monthly",
        seasonality: "Q4",
        profitMargin: 0.35,
        riskScore: 12 // fraud risk score 0-100
      },

      // Audit trail
      audit: {
        createdAt: new Date("2025-11-11T10:30:00Z"),
        updatedAt: new Date("2025-11-11T14:45:00Z"),
        createdBy: "system",
        updatedBy: "user_12345",
        version: 2,

        // Change history for critical fields
        statusHistory: [
          {
            status: "pending",
            timestamp: new Date("2025-11-11T10:30:00Z"),
            userId: "customer_67890"
          },
          {
            status: "processing", 
            timestamp: new Date("2025-11-11T14:45:00Z"),
            userId: "system"
          }
        ]
      }
    };

    // Insert sample data for index testing
    await ordersCollection.insertOne(sampleOrder);

    // Create comprehensive index strategy
    await this.createIntelligentIndexes(ordersCollection);

    return ordersCollection;
  }

  async createIntelligentIndexes(collection) {
    console.log('Creating intelligent index strategy...');

    try {
      // 1. Primary query patterns - single field indexes
      await collection.createIndexes([

        // Customer-based queries (most common pattern)
        {
          key: { "customer.customerId": 1 },
          name: "idx_customer_id",
          background: true
        },

        // Date-based queries for reporting
        {
          key: { "orderDate": -1 },
          name: "idx_order_date_desc", 
          background: true
        },

        // Status queries for operational workflows
        {
          key: { "status": 1 },
          name: "idx_status",
          background: true
        },

        // Order number lookups (unique)
        {
          key: { "orderNumber": 1 },
          name: "idx_order_number",
          unique: true,
          background: true
        }
      ]);

      // 2. Compound indexes for complex query patterns
      await collection.createIndexes([

        // Customer order history (most frequent compound query)
        {
          key: { 
            "customer.customerId": 1, 
            "orderDate": -1,
            "status": 1
          },
          name: "idx_customer_date_status",
          background: true
        },

        // Order fulfillment workflow
        {
          key: {
            "status": 1,
            "priority": 1,
            "orderDate": 1
          },
          name: "idx_fulfillment_workflow",
          background: true
        },

        // Financial reporting and analytics
        {
          key: {
            "orderDate": -1,
            "totals.grandTotal": -1,
            "customer.tier": 1
          },
          name: "idx_financial_reporting",
          background: true
        },

        // Sales rep performance tracking
        {
          key: {
            "salesInfo.salesRepId": 1,
            "orderDate": -1,
            "status": 1
          },
          name: "idx_sales_rep_performance",
          background: true
        },

        // Geographic analysis
        {
          key: {
            "addresses.shipping.state": 1,
            "addresses.shipping.city": 1,
            "orderDate": -1
          },
          name: "idx_geographic_analysis",
          background: true
        }
      ]);

      // 3. Specialized indexes for advanced query patterns
      await collection.createIndexes([

        // Text search across multiple fields
        {
          key: {
            "customer.name": "text",
            "customer.email": "text", 
            "orderNumber": "text",
            "items.name": "text",
            "items.sku": "text"
          },
          name: "idx_text_search",
          background: true
        },

        // Geospatial index for location-based queries
        {
          key: { "addresses.shipping.coordinates": "2dsphere" },
          name: "idx_shipping_location",
          background: true
        },

        // Sparse index for optional tracking numbers
        {
          key: { "shipping.trackingNumber": 1 },
          name: "idx_tracking_number",
          sparse: true,
          background: true
        },

        // Partial index for recent high-value orders
        {
          key: { 
            "orderDate": -1,
            "totals.grandTotal": -1 
          },
          name: "idx_recent_high_value",
          partialFilterExpression: {
            "orderDate": { $gte: new Date(Date.now() - 90 * 24 * 60 * 60 * 1000) },
            "totals.grandTotal": { $gte: 500 }
          },
          background: true
        }
      ]);

      // 4. Array indexing for embedded documents
      await collection.createIndexes([

        // Product-based queries on order items
        {
          key: { "items.productId": 1 },
          name: "idx_product_id",
          background: true
        },

        // SKU lookups
        {
          key: { "items.sku": 1 },
          name: "idx_item_sku",
          background: true
        },

        // Category-based analytics
        {
          key: { 
            "items.category": 1,
            "items.subcategory": 1,
            "orderDate": -1
          },
          name: "idx_category_analytics",
          background: true
        },

        // Brand analysis
        {
          key: { "items.attributes.brand": 1 },
          name: "idx_brand_analysis",
          background: true
        }
      ]);

      // 5. TTL index for data lifecycle management
      await collection.createIndex(
        { "audit.createdAt": 1 },
        { 
          name: "idx_ttl_cleanup",
          expireAfterSeconds: 60 * 60 * 24 * 365 * 7, // 7 years retention
          background: true
        }
      );

      console.log('Intelligent indexing strategy implemented successfully');

    } catch (error) {
      console.error('Error creating indexes:', error);
      throw error;
    }
  }

  async analyzeQueryPerformance(collection, queryPattern, options = {}) {
    console.log('Analyzing query performance with advanced explain plans...');

    try {
      // Sample query patterns for analysis
      const queryPatterns = {
        customerOrders: {
          filter: { "customer.customerId": new ObjectId("64a1b2c3d4e5f6789012345a") },
          sort: { "orderDate": -1 },
          limit: 20
        },

        recentHighValue: {
          filter: {
            "orderDate": { $gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) },
            "totals.grandTotal": { $gte: 100 },
            "status": { $in: ["processing", "shipped", "delivered"] }
          },
          sort: { "totals.grandTotal": -1 },
          limit: 50
        },

        fulfillmentQueue: {
          filter: {
            "status": "processing",
            "priority": { $in: ["expedited", "overnight"] }
          },
          sort: { "orderDate": 1 },
          limit: 100
        },

        salesAnalytics: {
          filter: {
            "salesInfo.salesRepId": new ObjectId("64c3d4e5f67890123456c2a"),
            "orderDate": { 
              $gte: new Date("2025-11-01"),
              $lt: new Date("2025-12-01")
            }
          },
          sort: { "orderDate": -1 }
        }
      };

      const selectedQuery = queryPatterns[queryPattern] || queryPatterns.customerOrders;

      // Execute explain plan with detailed analysis
      const explainResult = await collection.find(selectedQuery.filter)
        .sort(selectedQuery.sort || {})
        .limit(selectedQuery.limit || 1000)
        .explain("executionStats");

      // Analyze execution statistics
      const executionStats = explainResult.executionStats;
      const winningPlan = explainResult.queryPlanner.winningPlan;

      const performanceAnalysis = {
        queryPattern: queryPattern,
        executionTime: executionStats.executionTimeMillis,
        documentsExamined: executionStats.totalDocsExamined,
        documentsReturned: executionStats.totalDocsReturned,
        indexesUsed: this.extractIndexesUsed(winningPlan),

        // Performance efficiency metrics
        selectivityRatio: executionStats.totalDocsReturned / Math.max(executionStats.totalDocsExamined, 1),
        indexEfficiency: this.calculateIndexEfficiency(executionStats),

        // Performance classification
        performanceRating: this.classifyPerformance(executionStats),

        // Optimization recommendations
        recommendations: this.generateOptimizationRecommendations(explainResult),

        // Detailed execution breakdown
        executionBreakdown: this.analyzeExecutionStages(winningPlan),

        queryDetails: {
          filter: selectedQuery.filter,
          sort: selectedQuery.sort,
          limit: selectedQuery.limit
        },

        timestamp: new Date()
      };

      // Store performance metrics for trending
      this.performanceMetrics.set(queryPattern, performanceAnalysis);

      console.log(`Query Performance Analysis for ${queryPattern}:`);
      console.log(`  Execution Time: ${performanceAnalysis.executionTime}ms`);
      console.log(`  Documents Examined: ${performanceAnalysis.documentsExamined}`);
      console.log(`  Documents Returned: ${performanceAnalysis.documentsReturned}`);
      console.log(`  Selectivity Ratio: ${performanceAnalysis.selectivityRatio.toFixed(4)}`);
      console.log(`  Performance Rating: ${performanceAnalysis.performanceRating}`);
      console.log(`  Indexes Used: ${JSON.stringify(performanceAnalysis.indexesUsed)}`);

      if (performanceAnalysis.recommendations.length > 0) {
        console.log('  Optimization Recommendations:');
        performanceAnalysis.recommendations.forEach(rec => {
          console.log(`    - ${rec}`);
        });
      }

      return performanceAnalysis;

    } catch (error) {
      console.error('Error analyzing query performance:', error);
      throw error;
    }
  }

  extractIndexesUsed(winningPlan) {
    const indexes = [];

    const extractFromStage = (stage) => {
      if (stage.indexName) {
        indexes.push(stage.indexName);
      }

      if (stage.inputStage) {
        extractFromStage(stage.inputStage);
      }

      if (stage.inputStages) {
        stage.inputStages.forEach(inputStage => {
          extractFromStage(inputStage);
        });
      }
    };

    extractFromStage(winningPlan);
    return [...new Set(indexes)]; // Remove duplicates
  }

  calculateIndexEfficiency(executionStats) {
    // Index efficiency = (docs returned / docs examined) * (1 / execution time factor)
    const selectivity = executionStats.totalDocsReturned / Math.max(executionStats.totalDocsExamined, 1);
    const timeFactor = Math.min(executionStats.executionTimeMillis / 100, 1); // Normalize execution time

    return selectivity * (1 - timeFactor);
  }

  classifyPerformance(executionStats) {
    const { executionTimeMillis, totalDocsExamined, totalDocsReturned } = executionStats;
    const selectivity = totalDocsReturned / Math.max(totalDocsExamined, 1);

    if (executionTimeMillis < 10 && selectivity > 0.1) return 'Excellent';
    if (executionTimeMillis < 50 && selectivity > 0.01) return 'Good';
    if (executionTimeMillis < 100 && selectivity > 0.001) return 'Fair';
    return 'Poor';
  }

  generateOptimizationRecommendations(explainResult) {
    const recommendations = [];
    const executionStats = explainResult.executionStats;
    const winningPlan = explainResult.queryPlanner.winningPlan;

    // High execution time
    if (executionStats.executionTimeMillis > 100) {
      recommendations.push('Consider adding compound indexes for better query selectivity');
    }

    // Low selectivity (examining many documents vs returning few)
    const selectivity = executionStats.totalDocsReturned / Math.max(executionStats.totalDocsExamined, 1);
    if (selectivity < 0.01) {
      recommendations.push('Improve query selectivity with more specific filtering criteria');
    }

    // Collection scan detected
    if (winningPlan.stage === 'COLLSCAN') {
      recommendations.push('Critical: Query is performing collection scan - add appropriate indexes');
    }

    // Sort not using index
    if (this.findStageInPlan(winningPlan, 'SORT') && !this.findStageInPlan(winningPlan, 'IXSCAN')) {
      recommendations.push('Sort operation not using index - consider compound index with sort fields');
    }

    // High key examination
    if (executionStats.totalKeysExamined > executionStats.totalDocsReturned * 10) {
      recommendations.push('High key examination ratio - consider more selective compound indexes');
    }

    return recommendations;
  }

  findStageInPlan(plan, stageName) {
    if (plan.stage === stageName) return true;

    if (plan.inputStage && this.findStageInPlan(plan.inputStage, stageName)) return true;

    if (plan.inputStages) {
      return plan.inputStages.some(stage => this.findStageInPlan(stage, stageName));
    }

    return false;
  }

  analyzeExecutionStages(winningPlan) {
    const stages = [];

    const extractStages = (stage) => {
      stages.push({
        stage: stage.stage,
        indexName: stage.indexName || null,
        direction: stage.direction || null,
        keysExamined: stage.keysExamined || null,
        docsExamined: stage.docsExamined || null,
        executionTimeMillis: stage.executionTimeMillisEstimate || null
      });

      if (stage.inputStage) {
        extractStages(stage.inputStage);
      }

      if (stage.inputStages) {
        stage.inputStages.forEach(inputStage => {
          extractStages(inputStage);
        });
      }
    };

    extractStages(winningPlan);
    return stages;
  }

  async performComprehensiveIndexAnalysis(collection) {
    console.log('Performing comprehensive index analysis...');

    try {
      // Get index statistics
      const indexStats = await collection.aggregate([
        { $indexStats: {} }
      ]).toArray();

      // Get collection statistics
      const collectionStats = await db.runCommand({ collStats: collection.collectionName });

      // Analyze index usage patterns
      const indexAnalysis = indexStats.map(index => {
        const usageStats = index.accesses;
        const indexSize = index.size || 0;
        const indexName = index.name;

        return {
          name: indexName,

          // Usage metrics
          accessCount: usageStats.ops || 0,
          lastAccessed: usageStats.since || null,

          // Size metrics
          sizeBytes: indexSize,
          sizeMB: (indexSize / 1024 / 1024).toFixed(2),

          // Efficiency analysis
          accessFrequency: this.calculateAccessFrequency(usageStats),
          utilizationScore: this.calculateUtilizationScore(usageStats, indexSize),

          // Recommendations
          recommendation: this.analyzeIndexRecommendation(indexName, usageStats, indexSize)
        };
      });

      // Collection-level analysis
      const collectionAnalysis = {
        totalDocuments: collectionStats.count,
        totalSize: collectionStats.size,
        averageDocumentSize: collectionStats.avgObjSize,
        totalIndexSize: collectionStats.totalIndexSize,
        indexToDataRatio: (collectionStats.totalIndexSize / collectionStats.size).toFixed(2),

        // Index efficiency summary
        totalIndexes: indexStats.length,
        activeIndexes: indexStats.filter(idx => idx.accesses.ops > 0).length,
        unusedIndexes: indexStats.filter(idx => idx.accesses.ops === 0).length,

        // Performance indicators
        indexOverhead: ((collectionStats.totalIndexSize / collectionStats.size) * 100).toFixed(1) + '%',

        recommendations: this.generateCollectionRecommendations(indexAnalysis, collectionStats)
      };

      const analysis = {
        collection: collection.collectionName,
        analyzedAt: new Date(),
        collectionMetrics: collectionAnalysis,
        indexDetails: indexAnalysis,

        // Summary classifications
        performanceStatus: this.classifyCollectionPerformance(collectionAnalysis),
        optimizationPriority: this.determineOptimizationPriority(indexAnalysis),

        // Action items
        actionItems: this.generateActionItems(indexAnalysis, collectionAnalysis)
      };

      console.log('Index Analysis Summary:');
      console.log(`  Total Indexes: ${collectionAnalysis.totalIndexes}`);
      console.log(`  Active Indexes: ${collectionAnalysis.activeIndexes}`);  
      console.log(`  Unused Indexes: ${collectionAnalysis.unusedIndexes}`);
      console.log(`  Index Overhead: ${collectionAnalysis.indexOverhead}`);
      console.log(`  Performance Status: ${analysis.performanceStatus}`);

      return analysis;

    } catch (error) {
      console.error('Error performing index analysis:', error);
      throw error;
    }
  }

  calculateAccessFrequency(usageStats) {
    if (!usageStats.since || usageStats.ops === 0) return 'Never';

    const daysSince = (Date.now() - usageStats.since.getTime()) / (1000 * 60 * 60 * 24);
    const accessesPerDay = usageStats.ops / Math.max(daysSince, 1);

    if (accessesPerDay > 1000) return 'Very High';
    if (accessesPerDay > 100) return 'High';
    if (accessesPerDay > 10) return 'Moderate';
    if (accessesPerDay > 1) return 'Low';
    return 'Very Low';
  }

  calculateUtilizationScore(usageStats, indexSize) {
    // Score based on access frequency vs storage cost
    const accessCount = usageStats.ops || 0;
    const sizeCost = indexSize / (1024 * 1024); // Size in MB

    if (accessCount === 0) return 0;

    // Higher score for more accesses per MB of storage
    return Math.min((accessCount / Math.max(sizeCost, 1)) / 1000, 10);
  }

  analyzeIndexRecommendation(indexName, usageStats, indexSize) {
    if (indexName === '_id_') return 'System index - always keep';

    if (usageStats.ops === 0) {
      return 'Consider dropping - unused index consuming storage';
    }

    if (usageStats.ops < 10 && indexSize > 10 * 1024 * 1024) { // < 10 uses and > 10MB
      return 'Low utilization - evaluate if index is necessary';
    }

    if (usageStats.ops > 10000) {
      return 'High utilization - keep and monitor performance';
    }

    return 'Normal utilization - maintain current index';
  }

  generateCollectionRecommendations(indexAnalysis, collectionStats) {
    const recommendations = [];

    // Check for unused indexes
    const unusedIndexes = indexAnalysis.filter(idx => idx.accessCount === 0 && idx.name !== '_id_');
    if (unusedIndexes.length > 0) {
      recommendations.push(`Drop ${unusedIndexes.length} unused indexes to reduce storage overhead`);
    }

    // Check index-to-data ratio
    const indexRatio = collectionStats.totalIndexSize / collectionStats.size;
    if (indexRatio > 1.5) {
      recommendations.push('High index overhead - review index necessity and consider consolidation');
    }

    // Check for very large indexes with low utilization
    const inefficientIndexes = indexAnalysis.filter(idx => 
      idx.sizeBytes > 100 * 1024 * 1024 && idx.utilizationScore < 1
    );
    if (inefficientIndexes.length > 0) {
      recommendations.push('Large indexes with low utilization detected - consider optimization');
    }

    return recommendations;
  }

  classifyCollectionPerformance(collectionAnalysis) {
    const unusedRatio = collectionAnalysis.unusedIndexes / collectionAnalysis.totalIndexes;
    const indexOverheadPercent = parseFloat(collectionAnalysis.indexOverhead);

    if (unusedRatio > 0.3 || indexOverheadPercent > 200) return 'Poor';
    if (unusedRatio > 0.2 || indexOverheadPercent > 150) return 'Fair';
    if (unusedRatio > 0.1 || indexOverheadPercent > 100) return 'Good';
    return 'Excellent';
  }

  determineOptimizationPriority(indexAnalysis) {
    const unusedCount = indexAnalysis.filter(idx => idx.accessCount === 0).length;
    const lowUtilizationCount = indexAnalysis.filter(idx => idx.utilizationScore < 1).length;

    if (unusedCount > 3 || lowUtilizationCount > 5) return 'High';
    if (unusedCount > 1 || lowUtilizationCount > 2) return 'Medium';
    return 'Low';
  }

  generateActionItems(indexAnalysis, collectionAnalysis) {
    const actions = [];

    // Unused index cleanup
    const unusedIndexes = indexAnalysis.filter(idx => idx.accessCount === 0 && idx.name !== '_id_');
    unusedIndexes.forEach(idx => {
      actions.push({
        type: 'DROP_INDEX',
        indexName: idx.name,
        reason: 'Unused index consuming storage',
        priority: 'Medium',
        estimatedSavings: `${idx.sizeMB}MB storage`
      });
    });

    // Low utilization optimization
    const lowUtilizationIndexes = indexAnalysis.filter(idx => 
      idx.utilizationScore < 1 && idx.accessCount > 0 && idx.sizeBytes > 10 * 1024 * 1024
    );
    lowUtilizationIndexes.forEach(idx => {
      actions.push({
        type: 'REVIEW_INDEX',
        indexName: idx.name,
        reason: 'Low utilization for large index',
        priority: 'Low',
        recommendation: 'Evaluate query patterns and consider consolidation'
      });
    });

    return actions;
  }

  async demonstrateAdvancedQuerying(collection) {
    console.log('Demonstrating advanced querying with performance optimization...');

    const queryExamples = [
      {
        name: 'Customer Order History with Analytics',
        query: async () => {
          return await collection.find({
            "customer.customerId": new ObjectId("64a1b2c3d4e5f6789012345a"),
            "orderDate": { $gte: new Date("2025-01-01") }
          })
          .sort({ "orderDate": -1 })
          .limit(20)
          .explain("executionStats");
        }
      },

      {
        name: 'High-Value Recent Orders',
        query: async () => {
          return await collection.find({
            "orderDate": { $gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) },
            "totals.grandTotal": { $gte: 500 },
            "status": { $in: ["processing", "shipped", "delivered"] }
          })
          .sort({ "totals.grandTotal": -1 })
          .limit(50)
          .explain("executionStats");
        }
      },

      {
        name: 'Geographic Sales Analysis',
        query: async () => {
          return await collection.find({
            "addresses.shipping.state": "CA",
            "orderDate": { 
              $gte: new Date("2025-11-01"),
              $lt: new Date("2025-12-01")
            }
          })
          .sort({ "orderDate": -1 })
          .explain("executionStats");
        }
      },

      {
        name: 'Product Category Performance',
        query: async () => {
          return await collection.find({
            "items.category": "electronics",
            "orderDate": { $gte: new Date("2025-11-01") }
          })
          .sort({ "totals.grandTotal": -1 })
          .explain("executionStats");
        }
      }
    ];

    const results = {};

    for (const example of queryExamples) {
      try {
        console.log(`\nTesting: ${example.name}`);
        const result = await example.query();

        const stats = result.executionStats;
        const performance = {
          executionTime: stats.executionTimeMillis,
          documentsExamined: stats.totalDocsExamined,
          documentsReturned: stats.totalDocsReturned,
          indexesUsed: this.extractIndexesUsed(result.queryPlanner.winningPlan),
          efficiency: (stats.totalDocsReturned / Math.max(stats.totalDocsExamined, 1)).toFixed(4)
        };

        console.log(`  Execution Time: ${performance.executionTime}ms`);
        console.log(`  Efficiency Ratio: ${performance.efficiency}`);
        console.log(`  Indexes Used: ${JSON.stringify(performance.indexesUsed)}`);

        results[example.name] = performance;

      } catch (error) {
        console.error(`Error testing ${example.name}:`, error);
        results[example.name] = { error: error.message };
      }
    }

    return results;
  }
}

// Export optimization class
module.exports = { MongoDBIndexOptimizer };

// Benefits of MongoDB Index Optimization:
// - Intelligent compound indexing for complex query patterns
// - Automated performance analysis and recommendations
// - Flexible indexing strategies for evolving schemas
// - Advanced query execution analysis with detailed metrics
// - Comprehensive index utilization monitoring
// - Automated optimization suggestions based on usage patterns
// - Support for specialized indexes (geospatial, text, sparse, partial)
// - Integration with existing MongoDB ecosystem and tooling
// - Real-time performance monitoring and alerting capabilities
// - Cost-effective storage optimization through intelligent index management

Understanding MongoDB Index Architecture

Compound Index Design Patterns

MongoDB's compound indexing system supports sophisticated query optimization strategies:

// Advanced compound indexing patterns for enterprise applications
class CompoundIndexStrategist {
  constructor(db) {
    this.db = db;
    this.indexStrategies = new Map();
    this.queryPatterns = new Map();
  }

  async analyzeQueryPatternsAndCreateIndexes() {
    console.log('Analyzing query patterns and creating optimized compound indexes...');

    // Pattern 1: ESR (Equality, Sort, Range) Index Design
    const esrPattern = {
      description: "Equality-Sort-Range compound index optimization",

      // Customer order queries: customer (equality) + date (sort) + status (range)
      index: {
        "customer.customerId": 1,  // Equality first
        "orderDate": -1,           // Sort second  
        "status": 1                // Range/filter third
      },

      queryExamples: [
        {
          filter: { 
            "customer.customerId": "specific_customer_id",
            "status": { $in: ["processing", "shipped"] }
          },
          sort: { "orderDate": -1 },
          description: "Customer order history with status filtering"
        }
      ],

      performance: "Optimal - follows ESR pattern for maximum efficiency"
    };

    // Pattern 2: Multi-dimensional Analytics Index
    const analyticsPattern = {
      description: "Multi-dimensional analytics with hierarchical grouping",

      index: {
        "orderDate": -1,           // Time dimension (most selective)
        "customer.tier": 1,        // Customer segment
        "items.category": 1,       // Product category
        "totals.grandTotal": -1    // Value dimension
      },

      queryExamples: [
        {
          pipeline: [
            { 
              $match: {
                "orderDate": { $gte: new Date("2025-01-01") },
                "customer.tier": "premium"
              }
            },
            {
              $group: {
                _id: {
                  month: { $dateToString: { format: "%Y-%m", date: "$orderDate" } },
                  category: "$items.category"
                },
                totalRevenue: { $sum: "$totals.grandTotal" },
                orderCount: { $sum: 1 }
              }
            }
          ],
          description: "Monthly revenue by customer tier and product category"
        }
      ]
    };

    // Pattern 3: Geospatial + Business Logic Index
    const geospatialPattern = {
      description: "Geospatial queries combined with business filters",

      index: {
        "addresses.shipping.coordinates": "2dsphere",  // Geospatial first
        "status": 1,                                    // Business filter
        "orderDate": -1                                 // Time component
      },

      queryExamples: [
        {
          filter: {
            "addresses.shipping.coordinates": {
              $near: {
                $geometry: { type: "Point", coordinates: [-122.4194, 37.7749] },
                $maxDistance: 10000 // 10km radius
              }
            },
            "status": "processing",
            "orderDate": { $gte: new Date("2025-11-01") }
          },
          description: "Recent processing orders within geographic radius"
        }
      ]
    };

    // Pattern 4: Text Search + Faceted Filtering
    const textSearchPattern = {
      description: "Full-text search with multiple filter dimensions",

      textIndex: {
        "customer.name": "text",
        "items.name": "text", 
        "items.sku": "text",
        "orderNumber": "text"
      },

      supportingIndexes: [
        {
          "customer.tier": 1,
          "orderDate": -1
        },
        {
          "items.category": 1,
          "totals.grandTotal": -1
        }
      ],

      queryExamples: [
        {
          filter: {
            $text: { $search: "premium widget" },
            "customer.tier": "enterprise",
            "orderDate": { $gte: new Date("2025-10-01") }
          },
          sort: { score: { $meta: "textScore" } },
          description: "Text search with customer tier and date filtering"
        }
      ]
    };

    // Create indexes based on patterns
    const ordersCollection = this.db.collection('orders');

    await this.implementIndexStrategy(ordersCollection, 'ESR_Pattern', esrPattern.index);
    await this.implementIndexStrategy(ordersCollection, 'Analytics_Pattern', analyticsPattern.index);  
    await this.implementIndexStrategy(ordersCollection, 'Geospatial_Pattern', geospatialPattern.index);
    await this.implementTextSearchStrategy(ordersCollection, textSearchPattern);

    // Store strategies for analysis
    this.indexStrategies.set('esr', esrPattern);
    this.indexStrategies.set('analytics', analyticsPattern);
    this.indexStrategies.set('geospatial', geospatialPattern);
    this.indexStrategies.set('textSearch', textSearchPattern);

    console.log('Advanced compound index strategies implemented');
    return this.indexStrategies;
  }

  async implementIndexStrategy(collection, strategyName, indexSpec) {
    try {
      await collection.createIndex(indexSpec, {
        name: `idx_${strategyName.toLowerCase()}`,
        background: true
      });
      console.log(`✅ Created index strategy: ${strategyName}`);
    } catch (error) {
      console.error(`❌ Failed to create ${strategyName}:`, error.message);
    }
  }

  async implementTextSearchStrategy(collection, textPattern) {
    try {
      // Create text index
      await collection.createIndex(textPattern.textIndex, {
        name: "idx_text_search_comprehensive",
        background: true
      });

      // Create supporting indexes for faceted filtering
      for (let i = 0; i < textPattern.supportingIndexes.length; i++) {
        await collection.createIndex(textPattern.supportingIndexes[i], {
          name: `idx_text_support_${i + 1}`,
          background: true
        });
      }

      console.log('✅ Created text search strategy with supporting indexes');
    } catch (error) {
      console.error('❌ Failed to create text search strategy:', error.message);
    }
  }

  async optimizeExistingIndexes(collection) {
    console.log('Optimizing existing indexes based on query patterns...');

    try {
      // Get current indexes
      const currentIndexes = await collection.listIndexes().toArray();

      // Analyze index effectiveness
      const indexAnalysis = await this.analyzeIndexEffectiveness(collection, currentIndexes);

      // Generate optimization plan
      const optimizationPlan = this.createOptimizationPlan(indexAnalysis);

      // Execute optimization (with safety checks)
      await this.executeOptimizationPlan(collection, optimizationPlan);

      return optimizationPlan;

    } catch (error) {
      console.error('Error optimizing indexes:', error);
      throw error;
    }
  }

  async analyzeIndexEffectiveness(collection, indexes) {
    const analysis = [];

    for (const index of indexes) {
      if (index.name === '_id_') continue; // Skip default index

      try {
        // Get index statistics
        const stats = await collection.aggregate([
          { $indexStats: {} },
          { $match: { name: index.name } }
        ]).toArray();

        const indexStat = stats[0];
        if (!indexStat) continue;

        // Analyze index composition
        const indexComposition = this.analyzeIndexComposition(index.key);

        // Calculate efficiency metrics
        const efficiency = {
          usageCount: indexStat.accesses?.ops || 0,
          lastUsed: indexStat.accesses?.since || null,
          sizeBytes: indexStat.size || 0,

          // Index pattern analysis
          composition: indexComposition,
          followsESRPattern: this.checkESRPattern(index.key),
          hasRedundancy: await this.checkIndexRedundancy(collection, index),

          // Performance classification
          utilizationScore: this.calculateUtilizationScore(indexStat),
          efficiencyRating: this.rateIndexEfficiency(indexStat, indexComposition)
        };

        analysis.push({
          name: index.name,
          keyPattern: index.key,
          ...efficiency
        });

      } catch (error) {
        console.warn(`Could not analyze index ${index.name}:`, error.message);
      }
    }

    return analysis;
  }

  analyzeIndexComposition(keyPattern) {
    const keys = Object.keys(keyPattern);
    const composition = {
      fieldCount: keys.length,
      hasEquality: false,
      hasSort: false,
      hasRange: false,
      hasGeospatial: false,
      hasText: false
    };

    keys.forEach((key, index) => {
      const value = keyPattern[key];

      // Detect index type based on value and position
      if (value === 1 || value === -1) {
        if (index === 0) composition.hasEquality = true;
        if (index === 1) composition.hasSort = true;
        if (index > 1) composition.hasRange = true;
      }

      if (value === '2dsphere' || value === '2d') composition.hasGeospatial = true;
      if (value === 'text') composition.hasText = true;
    });

    return composition;
  }

  checkESRPattern(keyPattern) {
    const keys = Object.keys(keyPattern);
    if (keys.length < 3) return false;

    // ESR: First field equality, second sort, third range
    const values = Object.values(keyPattern);
    return (values[0] === 1 || values[0] === -1) &&
           (values[1] === 1 || values[1] === -1) &&
           (values[2] === 1 || values[2] === -1);
  }

  async checkIndexRedundancy(collection, targetIndex) {
    // Check if this index is redundant with other indexes
    const allIndexes = await collection.listIndexes().toArray();
    const targetKeys = Object.keys(targetIndex.key);

    for (const otherIndex of allIndexes) {
      if (otherIndex.name === targetIndex.name || otherIndex.name === '_id_') continue;

      const otherKeys = Object.keys(otherIndex.key);

      // Check if targetIndex is a prefix of otherIndex (redundant)
      if (targetKeys.length <= otherKeys.length) {
        const isPrefix = targetKeys.every((key, index) => 
          otherKeys[index] === key && 
          targetIndex.key[key] === otherIndex.key[key]
        );

        if (isPrefix) return otherIndex.name;
      }
    }

    return false;
  }

  calculateUtilizationScore(indexStat) {
    const usage = indexStat.accesses?.ops || 0;
    const size = indexStat.size || 0;

    if (usage === 0) return 0;
    if (size === 0) return 10; // System indexes

    // Score based on usage per MB
    const sizeMB = size / (1024 * 1024);
    return Math.min((usage / sizeMB) / 100, 10);
  }

  rateIndexEfficiency(indexStat, composition) {
    let score = 5; // Base score

    // Usage factor
    const usage = indexStat.accesses?.ops || 0;
    if (usage > 10000) score += 2;
    else if (usage > 1000) score += 1;
    else if (usage === 0) score -= 3;

    // Composition factor
    if (composition.followsESRPattern) score += 2;
    if (composition.hasGeospatial || composition.hasText) score += 1;
    if (composition.fieldCount > 5) score -= 1; // Too many fields

    // Size factor (prefer smaller indexes for same functionality)
    const sizeMB = (indexStat.size || 0) / (1024 * 1024);
    if (sizeMB > 100) score -= 1;

    return Math.max(Math.min(score, 10), 0);
  }

  createOptimizationPlan(indexAnalysis) {
    const plan = {
      actions: [],
      expectedBenefits: [],
      risks: [],
      estimatedImpact: {}
    };

    // Identify unused indexes
    const unusedIndexes = indexAnalysis.filter(idx => idx.usageCount === 0);
    unusedIndexes.forEach(idx => {
      plan.actions.push({
        type: 'DROP',
        indexName: idx.name,
        reason: 'Unused index consuming storage',
        impact: `Save ${(idx.sizeBytes / 1024 / 1024).toFixed(2)}MB storage`,
        priority: 'HIGH'
      });
    });

    // Identify redundant indexes
    const redundantIndexes = indexAnalysis.filter(idx => idx.hasRedundancy);
    redundantIndexes.forEach(idx => {
      plan.actions.push({
        type: 'DROP',
        indexName: idx.name,
        reason: `Redundant with ${idx.hasRedundancy}`,
        impact: 'Reduce index maintenance overhead',
        priority: 'MEDIUM'
      });
    });

    // Suggest compound index improvements
    const inefficientIndexes = indexAnalysis.filter(idx => 
      idx.efficiencyRating < 5 && idx.usageCount > 0
    );
    inefficientIndexes.forEach(idx => {
      if (!idx.composition.followsESRPattern) {
        plan.actions.push({
          type: 'REBUILD',
          indexName: idx.name,
          reason: 'Does not follow ESR pattern',
          suggestion: 'Reorder fields: Equality, Sort, Range',
          impact: 'Improve query performance',
          priority: 'MEDIUM'
        });
      }
    });

    // Calculate expected benefits
    const storageSavings = unusedIndexes.reduce((sum, idx) => sum + idx.sizeBytes, 0);
    plan.estimatedImpact.storageSavings = `${(storageSavings / 1024 / 1024).toFixed(2)}MB`;
    plan.estimatedImpact.maintenanceReduction = `${unusedIndexes.length + redundantIndexes.length} fewer indexes`;

    return plan;
  }

  async executeOptimizationPlan(collection, plan) {
    console.log('Executing index optimization plan...');

    for (const action of plan.actions) {
      try {
        if (action.type === 'DROP' && action.priority === 'HIGH') {
          // Only auto-execute high-priority drops (unused indexes)
          console.log(`Dropping unused index: ${action.indexName}`);
          await collection.dropIndex(action.indexName);
          console.log(`✅ Successfully dropped index: ${action.indexName}`);
        } else {
          console.log(`📋 Recommended action: ${action.type} ${action.indexName} - ${action.reason}`);
        }
      } catch (error) {
        console.error(`❌ Failed to execute action on ${action.indexName}:`, error.message);
      }
    }

    console.log('Index optimization plan execution completed');
  }

  async generatePerformanceReport(collection) {
    console.log('Generating comprehensive performance report...');

    try {
      // Get collection statistics
      const stats = await this.db.runCommand({ collStats: collection.collectionName });

      // Get index usage statistics
      const indexStats = await collection.aggregate([
        { $indexStats: {} }
      ]).toArray();

      // Analyze recent query performance
      const performanceMetrics = Array.from(this.performanceMetrics.values());

      // Generate comprehensive report
      const report = {
        collectionName: collection.collectionName,
        generatedAt: new Date(),

        // Collection overview
        overview: {
          totalDocuments: stats.count,
          totalSizeGB: (stats.size / 1024 / 1024 / 1024).toFixed(2),
          averageDocumentSizeKB: (stats.avgObjSize / 1024).toFixed(2),
          totalIndexes: indexStats.length,
          totalIndexSizeGB: (stats.totalIndexSize / 1024 / 1024 / 1024).toFixed(2),
          indexToDataRatio: (stats.totalIndexSize / stats.size).toFixed(2)
        },

        // Index performance summary
        indexPerformance: {
          activeIndexes: indexStats.filter(idx => idx.accesses?.ops > 0).length,
          unusedIndexes: indexStats.filter(idx => idx.accesses?.ops === 0).length - 1, // Exclude _id_
          highUtilizationIndexes: indexStats.filter(idx => idx.accesses?.ops > 10000).length,

          // Top performing indexes
          topIndexes: indexStats
            .filter(idx => idx.name !== '_id_' && idx.accesses?.ops > 0)
            .sort((a, b) => (b.accesses?.ops || 0) - (a.accesses?.ops || 0))
            .slice(0, 5)
            .map(idx => ({
              name: idx.name,
              accessCount: idx.accesses?.ops || 0,
              sizeMB: ((idx.size || 0) / 1024 / 1024).toFixed(2)
            }))
        },

        // Query performance analysis
        queryPerformance: {
          totalQueriesAnalyzed: performanceMetrics.length,
          averageExecutionTime: performanceMetrics.length > 0 ? 
            (performanceMetrics.reduce((sum, metric) => sum + metric.executionTime, 0) / performanceMetrics.length).toFixed(2) : 0,
          excellentQueries: performanceMetrics.filter(m => m.performanceRating === 'Excellent').length,
          poorQueries: performanceMetrics.filter(m => m.performanceRating === 'Poor').length,

          // Query patterns
          commonPatterns: this.identifyCommonQueryPatterns(performanceMetrics)
        },

        // Recommendations
        recommendations: this.generatePerformanceRecommendations(stats, indexStats, performanceMetrics),

        // Health score
        healthScore: this.calculateHealthScore(stats, indexStats, performanceMetrics)
      };

      // Display report summary
      console.log('\n📊 Performance Report Summary:');
      console.log(`Collection: ${report.collectionName}`);
      console.log(`Documents: ${report.overview.totalDocuments.toLocaleString()}`);
      console.log(`Data Size: ${report.overview.totalSizeGB}GB`);
      console.log(`Index Size: ${report.overview.totalIndexSizeGB}GB`);
      console.log(`Active Indexes: ${report.indexPerformance.activeIndexes}/${report.overview.totalIndexes}`);
      console.log(`Health Score: ${report.healthScore}/100`);

      if (report.recommendations.length > 0) {
        console.log('\n💡 Top Recommendations:');
        report.recommendations.slice(0, 3).forEach(rec => {
          console.log(`  • ${rec}`);
        });
      }

      return report;

    } catch (error) {
      console.error('Error generating performance report:', error);
      throw error;
    }
  }

  identifyCommonQueryPatterns(performanceMetrics) {
    // Analyze query patterns to identify common access patterns
    const patterns = new Map();

    performanceMetrics.forEach(metric => {
      const pattern = metric.queryPattern || 'unknown';
      if (patterns.has(pattern)) {
        patterns.set(pattern, patterns.get(pattern) + 1);
      } else {
        patterns.set(pattern, 1);
      }
    });

    return Array.from(patterns.entries())
      .sort(([,a], [,b]) => b - a)
      .slice(0, 5)
      .map(([pattern, count]) => ({ pattern, count }));
  }

  generatePerformanceRecommendations(collectionStats, indexStats, queryMetrics) {
    const recommendations = [];

    // Index optimization recommendations
    const unusedCount = indexStats.filter(idx => idx.name !== '_id_' && idx.accesses?.ops === 0).length;
    if (unusedCount > 0) {
      recommendations.push(`Remove ${unusedCount} unused indexes to reduce storage and maintenance overhead`);
    }

    // Size recommendations
    const indexRatio = collectionStats.totalIndexSize / collectionStats.size;
    if (indexRatio > 1.5) {
      recommendations.push('High index-to-data ratio detected - review index necessity');
    }

    // Query performance recommendations
    const poorQueries = queryMetrics.filter(m => m.performanceRating === 'Poor').length;
    if (poorQueries > 0) {
      recommendations.push(`Optimize ${poorQueries} poorly performing query patterns`);
    }

    // Compound index recommendations
    const singleFieldIndexes = indexStats.filter(idx => 
      Object.keys(idx.key || {}).length === 1 && idx.name !== '_id_'
    ).length;
    if (singleFieldIndexes > 5) {
      recommendations.push('Consider consolidating single-field indexes into compound indexes');
    }

    return recommendations;
  }

  calculateHealthScore(collectionStats, indexStats, queryMetrics) {
    let score = 100;

    // Index efficiency penalty
    const unusedIndexes = indexStats.filter(idx => idx.name !== '_id_' && idx.accesses?.ops === 0).length;
    const totalIndexes = indexStats.length - 1; // Exclude _id_
    const unusedRatio = unusedIndexes / Math.max(totalIndexes, 1);
    score -= unusedRatio * 30; // Up to 30 points penalty

    // Size efficiency penalty
    const indexRatio = collectionStats.totalIndexSize / collectionStats.size;
    if (indexRatio > 2) score -= 20;
    else if (indexRatio > 1.5) score -= 10;

    // Query performance penalty
    const poorQueryRatio = queryMetrics.filter(m => m.performanceRating === 'Poor').length / Math.max(queryMetrics.length, 1);
    score -= poorQueryRatio * 25; // Up to 25 points penalty

    // Average execution time penalty
    const avgExecutionTime = queryMetrics.length > 0 ? 
      queryMetrics.reduce((sum, metric) => sum + metric.executionTime, 0) / queryMetrics.length : 0;
    if (avgExecutionTime > 100) score -= 15;
    else if (avgExecutionTime > 50) score -= 8;

    return Math.max(Math.round(score), 0);
  }
}

// Export the compound index strategist
module.exports = { CompoundIndexStrategist };

SQL-Style Index Optimization with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB index optimization and performance tuning:

-- QueryLeaf index optimization with SQL-familiar syntax

-- Create optimized indexes using SQL DDL syntax
CREATE INDEX idx_customer_order_history ON orders (
  customer.customer_id ASC,
  order_date DESC,
  status ASC
) WITH (
  background = true,
  name = 'idx_customer_order_history'
);

-- Create compound indexes following ESR (Equality, Sort, Range) pattern
CREATE INDEX idx_sales_analytics ON orders (
  sales_info.sales_rep_id ASC,     -- Equality filter (most selective)
  order_date DESC,                 -- Sort operation
  totals.grand_total DESC          -- Range filter
) WITH (
  background = true,
  partial_filter = 'status IN (''completed'', ''delivered'')'
);

-- Create geospatial index for location-based queries
CREATE INDEX idx_shipping_location ON orders 
USING GEOSPHERE (addresses.shipping.coordinates)
WITH (background = true);

-- Create text index for search functionality
CREATE INDEX idx_full_text_search ON orders 
USING TEXT (
  customer.name,
  customer.email, 
  order_number,
  items.name,
  items.sku
) WITH (
  default_language = 'english',
  background = true
);

-- Analyze query performance with SQL EXPLAIN
EXPLAIN (ANALYZE true, BUFFERS true) 
SELECT 
  order_number,
  customer.name,
  order_date,
  totals.grand_total,
  status
FROM orders 
WHERE customer.customer_id = ObjectId('64a1b2c3d4e5f6789012345a')
  AND order_date >= CURRENT_DATE - INTERVAL '90 days'
  AND status IN ('processing', 'shipped', 'delivered')
ORDER BY order_date DESC
LIMIT 20;

-- Index usage analysis and optimization recommendations
WITH index_usage_stats AS (
  SELECT 
    index_name,
    access_count,
    last_accessed,
    size_bytes,
    size_mb,

    -- Calculate utilization metrics
    CASE 
      WHEN access_count = 0 THEN 'Unused'
      WHEN access_count < 100 THEN 'Low'
      WHEN access_count < 10000 THEN 'Moderate'
      ELSE 'High'
    END as usage_level,

    -- Calculate efficiency score
    CASE 
      WHEN access_count = 0 THEN 0
      ELSE ROUND((access_count::numeric / (size_mb + 1)) * 100, 2)
    END as efficiency_score

  FROM mongodb_index_statistics('orders')
  WHERE index_name != '_id_'
),
index_recommendations AS (
  SELECT 
    index_name,
    usage_level,
    efficiency_score,
    size_mb,

    -- Generate recommendations based on usage patterns
    CASE 
      WHEN usage_level = 'Unused' THEN 'DROP - Unused index consuming storage'
      WHEN usage_level = 'Low' AND size_mb > 10 THEN 'REVIEW - Low usage for large index'
      WHEN efficiency_score > 1000 THEN 'MAINTAIN - High efficiency index'
      WHEN efficiency_score < 50 THEN 'OPTIMIZE - Poor efficiency ratio'
      ELSE 'MONITOR - Normal usage pattern'
    END as recommendation,

    -- Priority for action
    CASE 
      WHEN usage_level = 'Unused' THEN 'HIGH'
      WHEN usage_level = 'Low' AND size_mb > 50 THEN 'MEDIUM'
      WHEN efficiency_score < 25 THEN 'MEDIUM'
      ELSE 'LOW'
    END as priority

  FROM index_usage_stats
)
SELECT 
  index_name,
  usage_level,
  ROUND(efficiency_score, 2) as efficiency_score,
  ROUND(size_mb, 2) as size_mb,
  recommendation,
  priority,

  -- Estimated impact
  CASE 
    WHEN recommendation LIKE 'DROP%' THEN CONCAT('Save ', ROUND(size_mb, 1), 'MB storage')
    WHEN recommendation LIKE 'OPTIMIZE%' THEN 'Improve query performance'
    ELSE 'Monitor performance'
  END as estimated_impact

FROM index_recommendations
ORDER BY 
  CASE priority 
    WHEN 'HIGH' THEN 1 
    WHEN 'MEDIUM' THEN 2 
    ELSE 3 
  END,
  efficiency_score ASC;

-- Compound index optimization analysis
WITH query_pattern_analysis AS (
  SELECT 
    collection_name,
    query_pattern,
    avg_execution_time_ms,
    avg_docs_examined,
    avg_docs_returned,

    -- Calculate selectivity ratio
    CASE 
      WHEN avg_docs_examined > 0 THEN 
        ROUND((avg_docs_returned::numeric / avg_docs_examined) * 100, 2)
      ELSE 0 
    END as selectivity_percent,

    -- Identify query pattern type
    CASE 
      WHEN query_pattern LIKE '%customer_id%' AND query_pattern LIKE '%order_date%' THEN 'customer_history'
      WHEN query_pattern LIKE '%status%' AND query_pattern LIKE '%priority%' THEN 'fulfillment'
      WHEN query_pattern LIKE '%sales_rep%' THEN 'sales_analytics'
      WHEN query_pattern LIKE '%location%' THEN 'geographic'
      ELSE 'other'
    END as pattern_type

  FROM mongodb_query_performance_log
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days'
    AND collection_name = 'orders'
),
index_optimization_opportunities AS (
  SELECT 
    pattern_type,
    COUNT(*) as query_count,
    AVG(avg_execution_time_ms) as avg_execution_time,
    AVG(selectivity_percent) as avg_selectivity,

    -- Performance classification
    CASE 
      WHEN AVG(avg_execution_time_ms) > 100 THEN 'Poor'
      WHEN AVG(avg_execution_time_ms) > 50 THEN 'Fair'
      WHEN AVG(avg_execution_time_ms) > 10 THEN 'Good'
      ELSE 'Excellent'
    END as performance_rating,

    -- Optimization recommendations
    CASE pattern_type
      WHEN 'customer_history' THEN 'Compound index: customer_id + order_date + status'
      WHEN 'fulfillment' THEN 'Compound index: status + priority + order_date'
      WHEN 'sales_analytics' THEN 'Compound index: sales_rep_id + order_date + total_amount'
      WHEN 'geographic' THEN 'Geospatial index: shipping_coordinates + status + date'
      ELSE 'Analyze query patterns for custom compound index'
    END as index_recommendation

  FROM query_pattern_analysis
  GROUP BY pattern_type
  HAVING COUNT(*) >= 10  -- Only analyze patterns with sufficient volume
)
SELECT 
  pattern_type,
  query_count,
  ROUND(avg_execution_time, 2) as avg_execution_time_ms,
  ROUND(avg_selectivity, 2) as avg_selectivity_percent,
  performance_rating,
  index_recommendation,

  -- Optimization priority
  CASE 
    WHEN performance_rating = 'Poor' AND query_count > 1000 THEN 'CRITICAL'
    WHEN performance_rating IN ('Poor', 'Fair') AND query_count > 100 THEN 'HIGH'
    WHEN performance_rating = 'Fair' THEN 'MEDIUM'
    ELSE 'LOW'
  END as optimization_priority

FROM index_optimization_opportunities
ORDER BY 
  CASE optimization_priority
    WHEN 'CRITICAL' THEN 1
    WHEN 'HIGH' THEN 2 
    WHEN 'MEDIUM' THEN 3
    ELSE 4
  END,
  query_count DESC;

-- Performance monitoring dashboard
WITH performance_metrics AS (
  SELECT 
    DATE_TRUNC('hour', timestamp) as hour_bucket,
    collection_name,

    -- Query performance metrics
    COUNT(*) as total_queries,
    AVG(execution_time_ms) as avg_execution_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY execution_time_ms) as p95_execution_time,
    MAX(execution_time_ms) as max_execution_time,

    -- Index usage metrics
    AVG(docs_examined::numeric / GREATEST(docs_returned, 1)) as avg_docs_per_result,
    AVG(CASE WHEN index_used THEN 1.0 ELSE 0.0 END) as index_usage_ratio,

    -- Query efficiency
    AVG(CASE WHEN docs_examined > 0 THEN docs_returned::numeric / docs_examined ELSE 1 END) as avg_selectivity,

    -- Performance classification
    COUNT(*) FILTER (WHERE execution_time_ms <= 10) as excellent_queries,
    COUNT(*) FILTER (WHERE execution_time_ms <= 50) as good_queries,
    COUNT(*) FILTER (WHERE execution_time_ms <= 100) as fair_queries,
    COUNT(*) FILTER (WHERE execution_time_ms > 100) as poor_queries

  FROM mongodb_query_performance_log
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    AND collection_name = 'orders'
  GROUP BY DATE_TRUNC('hour', timestamp), collection_name
),
performance_trends AS (
  SELECT *,
    -- Calculate performance trends
    LAG(avg_execution_time) OVER (ORDER BY hour_bucket) as prev_hour_avg_time,
    LAG(index_usage_ratio) OVER (ORDER BY hour_bucket) as prev_hour_index_usage,

    -- Performance health score (0-100)
    ROUND(
      (excellent_queries::numeric / total_queries * 40) +
      (good_queries::numeric / total_queries * 30) +
      (fair_queries::numeric / total_queries * 20) +
      (index_usage_ratio * 10),
      0
    ) as performance_health_score

  FROM performance_metrics
)
SELECT 
  TO_CHAR(hour_bucket, 'YYYY-MM-DD HH24:00') as monitoring_hour,
  total_queries,
  ROUND(avg_execution_time::numeric, 2) as avg_execution_time_ms,
  ROUND(p95_execution_time::numeric, 2) as p95_execution_time_ms,
  ROUND((index_usage_ratio * 100)::numeric, 1) as index_usage_percent,
  ROUND((avg_selectivity * 100)::numeric, 2) as avg_selectivity_percent,
  performance_health_score,

  -- Performance distribution
  CONCAT(
    excellent_queries, ' excellent, ',
    good_queries, ' good, ', 
    fair_queries, ' fair, ',
    poor_queries, ' poor'
  ) as query_distribution,

  -- Trend indicators
  CASE 
    WHEN avg_execution_time > prev_hour_avg_time * 1.2 THEN '📈 Degrading'
    WHEN avg_execution_time < prev_hour_avg_time * 0.8 THEN '📉 Improving' 
    ELSE '➡️ Stable'
  END as performance_trend,

  -- Health status
  CASE 
    WHEN performance_health_score >= 90 THEN '🟢 Excellent'
    WHEN performance_health_score >= 75 THEN '🟡 Good'
    WHEN performance_health_score >= 60 THEN '🟠 Fair'
    ELSE '🔴 Poor'
  END as health_status,

  -- Recommendations
  CASE 
    WHEN performance_health_score < 60 THEN 'Immediate optimization required'
    WHEN index_usage_ratio < 0.8 THEN 'Review query patterns and add missing indexes'
    WHEN avg_selectivity < 0.1 THEN 'Improve query selectivity with better filtering'
    WHEN poor_queries > total_queries * 0.1 THEN 'Optimize slow query patterns'
    ELSE 'Performance within acceptable range'
  END as recommendation

FROM performance_trends
WHERE hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
ORDER BY hour_bucket DESC;

-- Index maintenance automation
CREATE PROCEDURE optimize_collection_indexes(
  collection_name VARCHAR(100),
  maintenance_mode VARCHAR(20) DEFAULT 'conservative'
) AS
BEGIN
  -- Analyze current index usage
  WITH index_analysis AS (
    SELECT 
      index_name,
      access_count,
      size_bytes,
      last_accessed,
      CASE 
        WHEN access_count = 0 THEN 'unused'
        WHEN access_count < 10 AND size_bytes > 10 * 1024 * 1024 THEN 'underutilized'
        WHEN access_count > 50000 THEN 'high_usage'
        ELSE 'normal'
      END as usage_category
    FROM mongodb_index_statistics(collection_name)
    WHERE index_name != '_id_'
  )
  SELECT 
    COUNT(*) FILTER (WHERE usage_category = 'unused') as unused_count,
    COUNT(*) FILTER (WHERE usage_category = 'underutilized') as underutilized_count,
    SUM(size_bytes) FILTER (WHERE usage_category = 'unused') as unused_size_bytes
  INTO TEMPORARY TABLE maintenance_summary;

  -- Execute maintenance based on mode
  CASE maintenance_mode
    WHEN 'aggressive' THEN
      -- Drop unused and underutilized indexes
      CALL mongodb_drop_unused_indexes(collection_name);
      CALL mongodb_review_underutilized_indexes(collection_name);

    WHEN 'conservative' THEN 
      -- Only drop clearly unused indexes (0 access, older than 30 days)
      CALL mongodb_drop_unused_indexes(collection_name, min_age_days => 30);

    WHEN 'analyze_only' THEN
      -- Generate report without making changes
      CALL mongodb_generate_index_report(collection_name);
  END CASE;

  -- Log maintenance activity
  INSERT INTO index_maintenance_log (
    collection_name,
    maintenance_mode,
    maintenance_timestamp,
    unused_indexes_dropped,
    storage_saved_bytes
  ) 
  SELECT 
    collection_name,
    maintenance_mode,
    CURRENT_TIMESTAMP,
    (SELECT unused_count FROM maintenance_summary),
    (SELECT unused_size_bytes FROM maintenance_summary);

  COMMIT;
END;

-- QueryLeaf provides comprehensive index optimization capabilities:
-- 1. SQL-familiar index creation and management syntax
-- 2. Advanced compound index strategies with ESR pattern optimization
-- 3. Automated query performance analysis and explain plan interpretation
-- 4. Index usage monitoring and utilization tracking
-- 5. Performance trend analysis and health scoring
-- 6. Automated optimization recommendations based on usage patterns
-- 7. Maintenance procedures for index lifecycle management
-- 8. Integration with MongoDB's native indexing and performance features
-- 9. Real-time performance monitoring with alerting capabilities
-- 10. Familiar SQL patterns for complex index optimization requirements

Best Practices for Index Optimization

Index Design Principles

Essential practices for effective MongoDB index optimization:

ESR Pattern: Design compound indexes following Equality-Sort-Range order
Query-First Design: Create indexes based on actual query patterns, not theoretical needs
Selectivity Optimization: Place most selective fields first in compound indexes
Index Intersection: Leverage MongoDB's ability to use multiple indexes for complex queries
Covering Indexes: Include frequently accessed fields to avoid document lookups
Maintenance Balance: Balance query performance with write performance and storage costs

Performance Monitoring

Implement comprehensive performance monitoring for production environments:

Continuous Analysis: Monitor query performance patterns and execution statistics
Usage Tracking: Track index utilization to identify unused or underutilized indexes
Trend Analysis: Identify performance degradation trends before they impact users
Automated Alerting: Set up alerts for slow queries and index efficiency metrics
Regular Optimization: Schedule periodic index analysis and optimization cycles
Capacity Planning: Monitor index growth and plan for scaling requirements

Conclusion

MongoDB Index Optimization provides comprehensive query performance tuning capabilities that eliminate the complexity and manual overhead of traditional database optimization approaches. The combination of intelligent compound indexing, automated performance analysis, and sophisticated query execution monitoring enables proactive performance management that scales with growing data volumes and evolving access patterns.

Key Index Optimization benefits include:

Intelligent Compound Indexing: Advanced ESR pattern optimization for maximum query efficiency
Automated Performance Analysis: Comprehensive query execution analysis with actionable recommendations
Usage-Based Optimization: Index recommendations based on actual utilization patterns
Comprehensive Monitoring: Real-time performance tracking with trend analysis and alerting
Maintenance Automation: Automated cleanup of unused indexes and optimization suggestions
Developer Familiarity: SQL-style optimization patterns with MongoDB's flexible indexing system

Whether you're building high-traffic web applications, analytics platforms, real-time systems, or any application requiring exceptional database performance, MongoDB Index Optimization with QueryLeaf's familiar SQL interface provides the foundation for enterprise-grade performance engineering. This combination enables you to implement sophisticated optimization strategies while preserving familiar database interaction patterns.

QueryLeaf Integration: QueryLeaf automatically translates SQL index operations into MongoDB index management, providing SQL-familiar CREATE INDEX syntax, EXPLAIN plan analysis, and performance monitoring queries. Advanced optimization strategies, compound index design, and automated maintenance are seamlessly handled through familiar SQL patterns, making enterprise performance optimization both powerful and accessible.

The integration of comprehensive optimization capabilities with SQL-style operations makes MongoDB an ideal platform for applications requiring both exceptional performance and familiar database optimization patterns, ensuring your performance solutions remain both effective and maintainable as they scale and evolve.

November 10, 2025
29 min read

MongoDB Replica Sets for High Availability and Data Resilience: SQL-Compatible Distributed Database Architecture

Modern enterprise applications require database systems that can maintain continuous availability, handle hardware failures gracefully, and provide data redundancy without sacrificing performance or consistency. Traditional single-server database deployments create critical points of failure that can result in extended downtime, data loss, and significant business disruption when servers crash, networks fail, or maintenance windows require database restarts.

MongoDB Replica Sets provide comprehensive high availability and data resilience through automatic replication, intelligent failover, and distributed consensus mechanisms. Unlike traditional master-slave replication that requires manual intervention during failures, MongoDB Replica Sets automatically elect new primary nodes, maintain data consistency across multiple servers, and provide configurable read preferences for optimal performance and availability.

The Traditional High Availability Challenge

Conventional database high availability solutions have significant complexity and operational overhead:

-- Traditional PostgreSQL high availability setup - complex and operationally intensive

-- Primary server configuration with write-ahead logging
CREATE TABLE critical_business_data (
    transaction_id BIGSERIAL PRIMARY KEY,
    account_id BIGINT NOT NULL,
    transaction_type VARCHAR(50) NOT NULL,
    amount DECIMAL(15,2) NOT NULL,
    transaction_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Business logic fields
    from_account BIGINT,
    to_account BIGINT, 
    description TEXT,
    reference_number VARCHAR(100),
    status VARCHAR(20) DEFAULT 'pending',

    -- Audit and compliance
    created_by VARCHAR(100),
    authorized_by VARCHAR(100),
    authorized_at TIMESTAMP,

    -- Geographic and regulatory
    processing_region VARCHAR(50),
    regulatory_flags JSONB,

    -- System metadata
    server_id VARCHAR(50),
    processing_node VARCHAR(50),

    CONSTRAINT valid_transaction_type CHECK (
        transaction_type IN ('deposit', 'withdrawal', 'transfer', 'fee', 'interest', 'adjustment')
    ),
    CONSTRAINT valid_status CHECK (
        status IN ('pending', 'processing', 'completed', 'failed', 'cancelled')
    ),
    CONSTRAINT valid_amount CHECK (amount != 0)
);

-- Complex indexing for performance across multiple servers
CREATE INDEX idx_transactions_account_timestamp ON critical_business_data(account_id, transaction_timestamp DESC);
CREATE INDEX idx_transactions_status_type ON critical_business_data(status, transaction_type, transaction_timestamp);
CREATE INDEX idx_transactions_reference ON critical_business_data(reference_number) WHERE reference_number IS NOT NULL;
CREATE INDEX idx_transactions_region ON critical_business_data(processing_region, transaction_timestamp);

-- Streaming replication configuration (requires extensive setup)
-- postgresql.conf settings required:
-- wal_level = replica
-- max_wal_senders = 3
-- max_replication_slots = 3
-- archive_mode = on
-- archive_command = 'cp %p /var/lib/postgresql/archive/%f'

-- Create replication user (security complexity)
CREATE ROLE replication_user WITH REPLICATION LOGIN PASSWORD 'complex_secure_password';
GRANT CONNECT ON DATABASE production TO replication_user;

-- Manual standby server setup required on each replica
-- pg_basebackup -h primary_server -D /var/lib/postgresql/standby -U replication_user -v -P -W

-- Standby server recovery configuration (recovery.conf)
-- standby_mode = 'on'
-- primary_conninfo = 'host=primary_server port=5432 user=replication_user password=complex_secure_password'
-- trigger_file = '/var/lib/postgresql/failover_trigger'

-- Connection pooling and load balancing (requires external tools)
-- HAProxy configuration for read/write splitting
-- backend postgresql_primary
--   server primary primary_server:5432 check
-- backend postgresql_standby
--   server standby1 standby1_server:5432 check
--   server standby2 standby2_server:5432 check

-- Monitoring and health checking (complex setup)
SELECT 
    client_addr,
    state,
    sync_state,
    sync_priority,

    -- Replication lag monitoring
    pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) / 1024 / 1024 as flush_lag_mb,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / 1024 / 1024 as replay_lag_mb,

    -- Time-based lag analysis
    EXTRACT(EPOCH FROM (now() - backend_start)) as connection_age_seconds,
    EXTRACT(EPOCH FROM (now() - state_change)) as state_change_age_seconds

FROM pg_stat_replication
ORDER BY sync_priority, client_addr;

-- Manual failover procedure (complex and error-prone)
-- 1. Check replication status and lag
-- 2. Stop applications from connecting to primary
-- 3. Ensure all transactions are replicated
-- 4. Create trigger file on desired standby: touch /var/lib/postgresql/failover_trigger
-- 5. Update application connection strings
-- 6. Redirect traffic to new primary
-- 7. Reconfigure remaining standbys to follow new primary

-- Problems with traditional PostgreSQL HA:
-- 1. Complex manual setup and configuration management
-- 2. Manual failover procedures with potential for human error
-- 3. Split-brain scenarios without proper fencing mechanisms
-- 4. No automatic conflict resolution during network partitions
-- 5. Requires external load balancers and connection pooling solutions
-- 6. Limited built-in monitoring and alerting capabilities
-- 7. Difficult to add/remove replica servers dynamically
-- 8. Complex backup and recovery procedures across multiple servers
-- 9. No built-in read preference configuration
-- 10. Requires significant PostgreSQL expertise for proper maintenance

-- MySQL replication limitations (even more manual)
-- Enable binary logging on master:
-- log-bin=mysql-bin
-- server-id=1

-- Manual slave configuration:
CHANGE MASTER TO
  MASTER_HOST='master_server',
  MASTER_USER='replication_user',
  MASTER_PASSWORD='replication_password',
  MASTER_LOG_FILE='mysql-bin.000001',
  MASTER_LOG_POS=154;

START SLAVE;

-- Basic replication monitoring
SHOW SLAVE STATUS\G

-- MySQL HA problems:
-- - No automatic failover mechanisms
-- - Manual binary log position management
-- - Limited conflict resolution capabilities
-- - Basic monitoring and error reporting
-- - Complex setup for multi-master scenarios
-- - No built-in load balancing or read distribution
-- - Requires external tools for comprehensive HA solutions

MongoDB Replica Sets provide comprehensive high availability with minimal operational overhead:

// MongoDB Replica Sets - enterprise-ready high availability with automatic management
const { MongoClient } = require('mongodb');

// Replica set connection with automatic failover handling
const client = new MongoClient('mongodb://server1:27017,server2:27017,server3:27017/production?replicaSet=production-rs', {
  // Connection options optimized for high availability
  maxPoolSize: 50,
  minPoolSize: 5,
  maxIdleTimeMS: 300000, // 5 minutes
  serverSelectionTimeoutMS: 5000,
  heartbeatFrequencyMS: 10000, // 10 seconds

  // Read and write preferences for optimal performance
  readPreference: 'secondaryPreferred', // Distribute read load
  writeConcern: { w: 'majority', j: true, wtimeout: 5000 }, // Ensure data durability

  // Advanced replica set options
  maxStalenessSeconds: 90, // Maximum acceptable replication lag
  readConcern: { level: 'majority' }, // Ensure consistent reads

  // Connection resilience
  connectTimeoutMS: 10000,
  socketTimeoutMS: 30000,
  retryWrites: true,
  retryReads: true
});

const db = client.db('enterprise_production');

// Comprehensive business data model with replica set optimization
const setupEnterpriseCollections = async () => {
  console.log('Setting up enterprise collections with replica set optimization...');

  // Financial transactions with high availability requirements
  const transactions = db.collection('financial_transactions');

  // Sample enterprise transaction document structure
  const transactionDocument = {
    _id: new ObjectId(),

    // Transaction identification
    transactionId: "TXN-2025-11-10-001234567",
    externalReference: "EXT-REF-987654321",

    // Account information
    accounts: {
      sourceAccount: {
        accountId: new ObjectId("64a1b2c3d4e5f6789012347a"),
        accountNumber: "ACC-123456789",
        accountType: "checking",
        accountHolder: "Enterprise Customer LLC",
        bankCode: "ENTBANK001"
      },
      destinationAccount: {
        accountId: new ObjectId("64a1b2c3d4e5f6789012347b"), 
        accountNumber: "ACC-987654321",
        accountType: "savings",
        accountHolder: "Business Partner Inc",
        bankCode: "PARTNER002"
      }
    },

    // Transaction details
    transaction: {
      type: "wire_transfer", // wire_transfer, ach_transfer, check_deposit, etc.
      category: "business_payment",
      subcategory: "vendor_payment",

      // Financial amounts
      amount: {
        value: 125000.00,
        currency: "USD",
        precision: 2
      },

      fees: {
        processingFee: 25.00,
        wireTransferFee: 15.00,
        regulatoryFee: 2.50,
        totalFees: 42.50
      },

      // Exchange rate information (for international transfers)
      exchangeRate: {
        fromCurrency: "USD",
        toCurrency: "USD", 
        rate: 1.0000,
        rateTimestamp: new Date("2025-11-10T14:30:00Z"),
        rateProvider: "internal"
      }
    },

    // Status and workflow tracking
    status: {
      current: "pending_authorization", // pending, authorized, processing, completed, failed, cancelled
      workflow: [
        {
          status: "initiated",
          timestamp: new Date("2025-11-10T14:30:00Z"),
          userId: new ObjectId("64b2c3d4e5f6789012347c1a"),
          userRole: "customer",
          notes: "Transaction initiated via mobile banking"
        },
        {
          status: "validated",
          timestamp: new Date("2025-11-10T14:30:15Z"),
          userId: "system",
          userRole: "automated_validation",
          notes: "Account balance and limits validated"
        }
      ],

      // Authorization requirements
      authorization: {
        required: true,
        level: "dual_approval", // single, dual_approval, committee
        approvals: [
          {
            approverId: new ObjectId("64b2c3d4e5f6789012347c1b"),
            approverRole: "account_manager", 
            status: "approved",
            timestamp: new Date("2025-11-10T14:32:00Z"),
            notes: "Verified customer and transaction purpose"
          }
        ],
        pendingApprovals: [
          {
            approverId: new ObjectId("64b2c3d4e5f6789012347c1c"),
            approverRole: "compliance_officer",
            requiredBy: new Date("2025-11-10T16:30:00Z")
          }
        ]
      }
    },

    // Risk and compliance
    riskAssessment: {
      riskScore: 35, // 0-100 scale
      riskLevel: "medium", // low, medium, high, critical
      riskFactors: [
        {
          factor: "transaction_amount",
          score: 15,
          description: "Large transaction amount"
        },
        {
          factor: "customer_history",
          score: -5,
          description: "Established customer with good history"
        },
        {
          factor: "destination_account",
          score: 10,
          description: "New destination account"
        }
      ],

      complianceChecks: {
        amlScreening: {
          status: "completed",
          result: "clear",
          timestamp: new Date("2025-11-10T14:30:20Z"),
          provider: "compliance_engine"
        },
        sanctionsScreening: {
          status: "completed",
          result: "clear",
          timestamp: new Date("2025-11-10T14:30:22Z"),
          provider: "sanctions_database"
        },
        fraudDetection: {
          status: "completed",
          result: "low_risk", 
          score: 12,
          timestamp: new Date("2025-11-10T14:30:25Z"),
          provider: "fraud_detection_ai"
        }
      }
    },

    // Processing information
    processing: {
      scheduledProcessingTime: new Date("2025-11-10T15:00:00Z"),
      actualProcessingTime: null,
      processingServer: "txn-processor-03",
      processingRegion: "us-east-1",

      // Retry and error handling
      attemptCount: 1,
      maxAttempts: 3,
      lastAttemptTime: new Date("2025-11-10T14:30:00Z"),

      errors: [],

      // Performance tracking
      processingMetrics: {
        validationTimeMs: 150,
        riskAssessmentTimeMs: 250,
        complianceCheckTimeMs: 420,
        totalProcessingTimeMs: null
      }
    },

    // Audit and regulatory compliance
    audit: {
      createdAt: new Date("2025-11-10T14:30:00Z"),
      createdBy: new ObjectId("64b2c3d4e5f6789012347c1a"),
      updatedAt: new Date("2025-11-10T14:32:00Z"),
      updatedBy: new ObjectId("64b2c3d4e5f6789012347c1b"),
      version: 2,

      // Detailed change tracking
      changeLog: [
        {
          timestamp: new Date("2025-11-10T14:30:00Z"),
          userId: new ObjectId("64b2c3d4e5f6789012347c1a"),
          action: "transaction_initiated",
          changes: ["status.current", "transaction", "accounts"],
          ipAddress: "192.168.1.100",
          userAgent: "MobileBankingApp/2.1.3"
        },
        {
          timestamp: new Date("2025-11-10T14:32:00Z"),
          userId: new ObjectId("64b2c3d4e5f6789012347c1b"),
          action: "authorization_approved",
          changes: ["status.authorization.approvals"],
          ipAddress: "10.0.1.50",
          userAgent: "EnterprisePortal/1.8.2"
        }
      ]
    },

    // Geographic and regulatory context
    geography: {
      originatingCountry: "US",
      originatingState: "CA",
      destinationCountry: "US",
      destinationState: "NY",

      // Regulatory requirements by jurisdiction
      regulations: {
        uspCompliance: true,
        internationalTransfer: false,
        reportingThreshold: 10000.00,
        reportingRequired: true,
        reportingDeadline: new Date("2025-11-11T23:59:59Z")
      }
    },

    // System and operational metadata
    metadata: {
      environment: "production",
      dataCenter: "primary",
      applicationVersion: "banking-core-v3.2.1",
      correlationId: "corr-uuid-12345678-90ab-cdef",

      // High availability tracking
      replicaSet: {
        writeConcern: "majority",
        readPreference: "primary",
        maxStalenessSeconds: 60
      },

      // Performance optimization
      indexHints: {
        preferredIndex: "idx_transactions_account_status_date",
        queryOptimizer: "enabled"
      }
    }
  };

  // Insert sample transaction
  await transactions.insertOne(transactionDocument);

  // Create comprehensive indexes optimized for replica set performance
  await Promise.all([
    // Primary business query indexes
    transactions.createIndex(
      { 
        "accounts.sourceAccount.accountId": 1, 
        "audit.createdAt": -1,
        "status.current": 1 
      },
      { 
        name: "idx_transactions_source_account_date_status",
        background: true // Non-blocking index creation
      }
    ),

    transactions.createIndex(
      { 
        "transaction.type": 1,
        "status.current": 1,
        "audit.createdAt": -1 
      },
      { 
        name: "idx_transactions_type_status_date",
        background: true
      }
    ),

    // Risk and compliance queries
    transactions.createIndex(
      { 
        "riskAssessment.riskLevel": 1,
        "status.current": 1 
      },
      { 
        name: "idx_transactions_risk_status",
        background: true
      }
    ),

    // Authorization workflow queries
    transactions.createIndex(
      { 
        "status.authorization.pendingApprovals.approverId": 1,
        "status.current": 1 
      },
      { 
        name: "idx_transactions_pending_approvals",
        background: true
      }
    ),

    // Processing and scheduling queries
    transactions.createIndex(
      { 
        "processing.scheduledProcessingTime": 1,
        "status.current": 1 
      },
      { 
        name: "idx_transactions_scheduled_processing",
        background: true
      }
    ),

    // Geographic and regulatory reporting
    transactions.createIndex(
      { 
        "geography.originatingCountry": 1,
        "geography.regulations.reportingRequired": 1,
        "audit.createdAt": -1 
      },
      { 
        name: "idx_transactions_geography_reporting",
        background: true
      }
    ),

    // Full-text search for transaction descriptions and references
    transactions.createIndex(
      { 
        "transactionId": "text",
        "externalReference": "text",
        "transaction.category": "text",
        "accounts.sourceAccount.accountHolder": "text"
      },
      { 
        name: "idx_transactions_text_search",
        background: true
      }
    )
  ]);

  console.log('Enterprise collections and indexes created successfully');
  return { transactions };
};

// Replica Set Management and Monitoring
class ReplicaSetManager {
  constructor(client, db) {
    this.client = client;
    this.db = db;
    this.admin = client.db('admin');
    this.monitoringInterval = null;
    this.healthMetrics = {
      lastCheck: null,
      replicaSetStatus: null,
      memberHealth: [],
      replicationLag: {},
      alerts: []
    };
  }

  async initializeReplicaSetMonitoring() {
    console.log('Initializing replica set monitoring and management...');

    try {
      // Get initial replica set configuration
      const config = await this.getReplicaSetConfig();
      console.log('Current replica set configuration:', JSON.stringify(config, null, 2));

      // Start continuous health monitoring
      await this.startHealthMonitoring();

      // Setup automatic failover testing (in non-production environments)
      if (process.env.NODE_ENV !== 'production') {
        await this.setupFailoverTesting();
      }

      console.log('Replica set monitoring initialized successfully');

    } catch (error) {
      console.error('Failed to initialize replica set monitoring:', error);
      throw error;
    }
  }

  async getReplicaSetConfig() {
    try {
      // Get replica set configuration
      const config = await this.admin.command({ replSetGetConfig: 1 });

      return {
        setName: config.config._id,
        version: config.config.version,
        members: config.config.members.map(member => ({
          id: member._id,
          host: member.host,
          priority: member.priority,
          votes: member.votes,
          hidden: member.hidden || false,
          buildIndexes: member.buildIndexes !== false,
          tags: member.tags || {}
        })),
        settings: config.config.settings || {}
      };
    } catch (error) {
      console.error('Error getting replica set config:', error);
      throw error;
    }
  }

  async getReplicaSetStatus() {
    try {
      // Get current replica set status
      const status = await this.admin.command({ replSetGetStatus: 1 });

      const members = status.members.map(member => ({
        id: member._id,
        name: member.name,
        health: member.health,
        state: this.getStateDescription(member.state),
        stateStr: member.stateStr,
        uptime: member.uptime,
        optimeDate: member.optimeDate,
        lastHeartbeat: member.lastHeartbeat,
        lastHeartbeatRecv: member.lastHeartbeatRecv,
        pingMs: member.pingMs,
        syncSourceHost: member.syncSourceHost,
        syncSourceId: member.syncSourceId,

        // Replication lag calculation
        lag: status.members[0].optimeDate && member.optimeDate ? 
          Math.abs(status.members[0].optimeDate - member.optimeDate) / 1000 : 0
      }));

      return {
        setName: status.set,
        date: status.date,
        myState: this.getStateDescription(status.myState),
        primary: members.find(member => member.state === 'PRIMARY'),
        members: members,
        heartbeatIntervalMillis: status.heartbeatIntervalMillis
      };

    } catch (error) {
      console.error('Error getting replica set status:', error);
      throw error;
    }
  }

  getStateDescription(state) {
    const states = {
      0: 'STARTUP',
      1: 'PRIMARY',
      2: 'SECONDARY', 
      3: 'RECOVERING',
      5: 'STARTUP2',
      6: 'UNKNOWN',
      7: 'ARBITER',
      8: 'DOWN',
      9: 'ROLLBACK',
      10: 'REMOVED'
    };
    return states[state] || `UNKNOWN_STATE_${state}`;
  }

  async startHealthMonitoring() {
    console.log('Starting continuous replica set health monitoring...');

    // Monitor replica set health every 30 seconds
    this.monitoringInterval = setInterval(async () => {
      try {
        await this.performHealthCheck();
      } catch (error) {
        console.error('Health check failed:', error);
        this.healthMetrics.alerts.push({
          timestamp: new Date(),
          level: 'ERROR',
          message: 'Health check failed',
          error: error.message
        });
      }
    }, 30000);

    // Perform initial health check
    await this.performHealthCheck();
  }

  async performHealthCheck() {
    const checkStartTime = new Date();
    console.log('Performing replica set health check...');

    try {
      // Get current replica set status
      const status = await this.getReplicaSetStatus();
      this.healthMetrics.lastCheck = checkStartTime;
      this.healthMetrics.replicaSetStatus = status;

      // Check for alerts
      const alerts = [];

      // Check if primary is available
      if (!status.primary) {
        alerts.push({
          timestamp: checkStartTime,
          level: 'CRITICAL',
          message: 'No primary member found in replica set'
        });
      }

      // Check member health
      status.members.forEach(member => {
        if (member.health !== 1) {
          alerts.push({
            timestamp: checkStartTime,
            level: 'WARNING',
            message: `Member ${member.name} has health status ${member.health}`
          });
        }

        // Check replication lag
        if (member.lag > 30) { // 30 seconds
          alerts.push({
            timestamp: checkStartTime,
            level: 'WARNING',
            message: `Member ${member.name} has replication lag of ${member.lag} seconds`
          });
        }

        // Check for high ping times
        if (member.pingMs && member.pingMs > 100) { // 100ms
          alerts.push({
            timestamp: checkStartTime,
            level: 'INFO',
            message: `Member ${member.name} has high ping time of ${member.pingMs}ms`
          });
        }
      });

      // Update health metrics
      this.healthMetrics.memberHealth = status.members;
      this.healthMetrics.alerts = [...alerts, ...this.healthMetrics.alerts.slice(0, 50)]; // Keep last 50 alerts

      // Log health status
      if (alerts.length > 0) {
        console.warn(`Health check found ${alerts.length} issues:`, alerts);
      } else {
        console.log('Replica set health check passed - all members healthy');
      }

      // Store health metrics in database for historical analysis
      await this.storeHealthMetrics();

    } catch (error) {
      console.error('Health check error:', error);
      throw error;
    }
  }

  async storeHealthMetrics() {
    try {
      const healthCollection = this.db.collection('replica_set_health');

      const healthRecord = {
        timestamp: this.healthMetrics.lastCheck,
        replicaSetName: this.healthMetrics.replicaSetStatus?.setName,

        // Summary metrics
        summary: {
          totalMembers: this.healthMetrics.memberHealth?.length || 0,
          healthyMembers: this.healthMetrics.memberHealth?.filter(m => m.health === 1).length || 0,
          primaryAvailable: !!this.healthMetrics.replicaSetStatus?.primary,
          maxReplicationLag: Math.max(...this.healthMetrics.memberHealth?.map(m => m.lag) || [0]),
          alertCount: this.healthMetrics.alerts?.filter(a => a.timestamp > new Date(Date.now() - 300000)).length || 0 // Last 5 minutes
        },

        // Detailed member status
        members: this.healthMetrics.memberHealth?.map(member => ({
          name: member.name,
          state: member.state,
          health: member.health,
          uptime: member.uptime,
          replicationLag: member.lag,
          pingMs: member.pingMs,
          isPrimary: member.state === 'PRIMARY'
        })) || [],

        // Recent alerts
        recentAlerts: this.healthMetrics.alerts?.filter(a => 
          a.timestamp > new Date(Date.now() - 300000)
        ) || [],

        // Performance metrics
        performance: {
          healthCheckDuration: Date.now() - this.healthMetrics.lastCheck?.getTime(),
          heartbeatInterval: this.healthMetrics.replicaSetStatus?.heartbeatIntervalMillis
        }
      };

      // Insert with short TTL for cleanup
      await healthCollection.insertOne({
        ...healthRecord,
        expiresAt: new Date(Date.now() + 7 * 24 * 60 * 60 * 1000) // 7 days
      });

    } catch (error) {
      console.warn('Failed to store health metrics:', error);
    }
  }

  async performReadWriteOperations() {
    console.log('Testing read/write operations across replica set...');

    const testCollection = this.db.collection('replica_set_tests');
    const testStartTime = new Date();

    try {
      // Test write operation (will go to primary)
      const writeResult = await testCollection.insertOne({
        testType: 'replica_set_write_test',
        timestamp: testStartTime,
        testData: 'Testing write operation to primary',
        serverId: 'test-operation'
      }, {
        writeConcern: { w: 'majority', j: true, wtimeout: 5000 }
      });

      console.log('Write operation successful:', writeResult.insertedId);

      // Test read from secondary (if available)
      const readFromSecondary = await testCollection.findOne(
        { _id: writeResult.insertedId },
        { 
          readPreference: 'secondary',
          maxStalenessSeconds: 90 
        }
      );

      if (readFromSecondary) {
        const replicationDelay = new Date() - readFromSecondary.timestamp;
        console.log(`Read from secondary successful, replication delay: ${replicationDelay}ms`);
      } else {
        console.log('Read from secondary not available or data not yet replicated');
      }

      // Test read from primary
      const readFromPrimary = await testCollection.findOne(
        { _id: writeResult.insertedId },
        { readPreference: 'primary' }
      );

      console.log('Read from primary successful:', !!readFromPrimary);

      // Cleanup test document
      await testCollection.deleteOne({ _id: writeResult.insertedId });

      return {
        writeSuccessful: true,
        readFromSecondarySuccessful: !!readFromSecondary,
        readFromPrimarySuccessful: !!readFromPrimary,
        testDuration: Date.now() - testStartTime
      };

    } catch (error) {
      console.error('Read/write operation test failed:', error);
      return {
        writeSuccessful: false,
        error: error.message,
        testDuration: Date.now() - testStartTime
      };
    }
  }

  async demonstrateReadPreferences() {
    console.log('Demonstrating various read preferences...');

    const testCollection = this.db.collection('financial_transactions');

    try {
      // 1. Read from primary (default)
      console.log('\n1. Reading from PRIMARY:');
      const primaryStart = Date.now();
      const primaryResult = await testCollection.find({}, {
        readPreference: 'primary'
      }).limit(5).toArray();
      console.log(`   - Read ${primaryResult.length} documents from primary in ${Date.now() - primaryStart}ms`);

      // 2. Read from secondary (load balancing)
      console.log('\n2. Reading from SECONDARY:');
      const secondaryStart = Date.now();
      const secondaryResult = await testCollection.find({}, {
        readPreference: 'secondary',
        maxStalenessSeconds: 120 // Accept data up to 2 minutes old
      }).limit(5).toArray();
      console.log(`   - Read ${secondaryResult.length} documents from secondary in ${Date.now() - secondaryStart}ms`);

      // 3. Read from secondary preferred (fallback to primary)
      console.log('\n3. Reading with SECONDARY PREFERRED:');
      const secondaryPrefStart = Date.now();
      const secondaryPrefResult = await testCollection.find({}, {
        readPreference: 'secondaryPreferred',
        maxStalenessSeconds: 90
      }).limit(5).toArray();
      console.log(`   - Read ${secondaryPrefResult.length} documents with secondary preference in ${Date.now() - secondaryPrefStart}ms`);

      // 4. Read from primary preferred (use secondary if primary unavailable)
      console.log('\n4. Reading with PRIMARY PREFERRED:');
      const primaryPrefStart = Date.now();
      const primaryPrefResult = await testCollection.find({}, {
        readPreference: 'primaryPreferred'
      }).limit(5).toArray();
      console.log(`   - Read ${primaryPrefResult.length} documents with primary preference in ${Date.now() - primaryPrefStart}ms`);

      // 5. Read from nearest (lowest latency)
      console.log('\n5. Reading from NEAREST:');
      const nearestStart = Date.now();
      const nearestResult = await testCollection.find({}, {
        readPreference: 'nearest'
      }).limit(5).toArray();
      console.log(`   - Read ${nearestResult.length} documents from nearest member in ${Date.now() - nearestStart}ms`);

      // 6. Tagged read preference (specific member characteristics)
      console.log('\n6. Reading with TAGGED preferences:');
      try {
        const taggedStart = Date.now();
        const taggedResult = await testCollection.find({}, {
          readPreference: 'secondary',
          readPreferenceTags: [{ region: 'us-east' }, { datacenter: 'primary' }] // Fallback tags
        }).limit(5).toArray();
        console.log(`   - Read ${taggedResult.length} documents with tagged preference in ${Date.now() - taggedStart}ms`);
      } catch (error) {
        console.log('   - Tagged read preference not available (members may not have matching tags)');
      }

      return {
        primaryLatency: Date.now() - primaryStart,
        secondaryLatency: Date.now() - secondaryStart,
        readPreferencesSuccessful: true
      };

    } catch (error) {
      console.error('Read preference demonstration failed:', error);
      return {
        readPreferencesSuccessful: false,
        error: error.message
      };
    }
  }

  async demonstrateWriteConcerns() {
    console.log('Demonstrating various write concerns for data durability...');

    const testCollection = this.db.collection('write_concern_tests');

    try {
      // 1. Default write concern (w: 1)
      console.log('\n1. Testing default write concern (w: 1):');
      const defaultStart = Date.now();
      const defaultResult = await testCollection.insertOne({
        testType: 'default_write_concern',
        timestamp: new Date(),
        data: 'Testing default write concern'
      });
      console.log(`   - Default write completed in ${Date.now() - defaultStart}ms`);

      // 2. Majority write concern (w: 'majority')
      console.log('\n2. Testing majority write concern:');
      const majorityStart = Date.now();
      const majorityResult = await testCollection.insertOne({
        testType: 'majority_write_concern',
        timestamp: new Date(),
        data: 'Testing majority write concern for high durability'
      }, {
        writeConcern: { 
          w: 'majority', 
          j: true, // Wait for journal acknowledgment
          wtimeout: 5000 
        }
      });
      console.log(`   - Majority write completed in ${Date.now() - majorityStart}ms`);

      // 3. Specific member count write concern
      console.log('\n3. Testing specific member count write concern (w: 2):');
      const specificStart = Date.now();
      try {
        const specificResult = await testCollection.insertOne({
          testType: 'specific_count_write_concern',
          timestamp: new Date(),
          data: 'Testing specific member count write concern'
        }, {
          writeConcern: { 
            w: 2, 
            j: true,
            wtimeout: 5000 
          }
        });
        console.log(`   - Specific count write completed in ${Date.now() - specificStart}ms`);
      } catch (error) {
        console.log(`   - Specific count write failed (may not have enough members): ${error.message}`);
      }

      // 4. Tagged write concern
      console.log('\n4. Testing tagged write concern:');
      const taggedStart = Date.now();
      try {
        const taggedResult = await testCollection.insertOne({
          testType: 'tagged_write_concern',
          timestamp: new Date(),
          data: 'Testing tagged write concern'
        }, {
          writeConcern: { 
            w: { region: 'us-east' }, // Write must be acknowledged by members with this tag
            wtimeout: 5000 
          }
        });
        console.log(`   - Tagged write completed in ${Date.now() - taggedStart}ms`);
      } catch (error) {
        console.log(`   - Tagged write failed (members may not have matching tags): ${error.message}`);
      }

      // 5. Unacknowledged write concern (w: 0) - not recommended for production
      console.log('\n5. Testing unacknowledged write concern (fire-and-forget):');
      const unackedStart = Date.now();
      const unackedResult = await testCollection.insertOne({
        testType: 'unacknowledged_write_concern',
        timestamp: new Date(),
        data: 'Testing unacknowledged write concern (not recommended for production)'
      }, {
        writeConcern: { w: 0 }
      });
      console.log(`   - Unacknowledged write completed in ${Date.now() - unackedStart}ms`);

      // Cleanup test documents
      await testCollection.deleteMany({ testType: { $regex: /_write_concern$/ } });

      return {
        allWriteConcernsSuccessful: true,
        performanceMetrics: {
          defaultLatency: Date.now() - defaultStart,
          majorityLatency: Date.now() - majorityStart
        }
      };

    } catch (error) {
      console.error('Write concern demonstration failed:', error);
      return {
        allWriteConcernsSuccessful: false,
        error: error.message
      };
    }
  }

  async getHealthSummary() {
    return {
      lastHealthCheck: this.healthMetrics.lastCheck,
      replicaSetName: this.healthMetrics.replicaSetStatus?.setName,
      totalMembers: this.healthMetrics.memberHealth?.length || 0,
      healthyMembers: this.healthMetrics.memberHealth?.filter(m => m.health === 1).length || 0,
      primaryAvailable: !!this.healthMetrics.replicaSetStatus?.primary,
      maxReplicationLag: Math.max(...this.healthMetrics.memberHealth?.map(m => m.lag) || [0]),
      recentAlertCount: this.healthMetrics.alerts?.filter(a => 
        a.timestamp > new Date(Date.now() - 300000)
      ).length || 0,
      isHealthy: this.isReplicaSetHealthy()
    };
  }

  isReplicaSetHealthy() {
    if (!this.healthMetrics.replicaSetStatus) return false;

    const hasHealthyPrimary = !!this.healthMetrics.replicaSetStatus.primary;
    const allMembersHealthy = this.healthMetrics.memberHealth?.every(m => m.health === 1) || false;
    const lowReplicationLag = Math.max(...this.healthMetrics.memberHealth?.map(m => m.lag) || [0]) < 30;
    const noRecentCriticalAlerts = !this.healthMetrics.alerts?.some(a => 
      a.level === 'CRITICAL' && a.timestamp > new Date(Date.now() - 300000)
    );

    return hasHealthyPrimary && allMembersHealthy && lowReplicationLag && noRecentCriticalAlerts;
  }

  async shutdown() {
    console.log('Shutting down replica set monitoring...');

    if (this.monitoringInterval) {
      clearInterval(this.monitoringInterval);
      this.monitoringInterval = null;
    }

    console.log('Replica set monitoring shutdown complete');
  }
}

// Advanced High Availability Operations
class HighAvailabilityOperations {
  constructor(client, db) {
    this.client = client;
    this.db = db;
    this.admin = client.db('admin');
  }

  async demonstrateFailoverScenarios() {
    console.log('Demonstrating automatic failover capabilities...');

    try {
      // Get initial replica set status
      const initialStatus = await this.admin.command({ replSetGetStatus: 1 });
      const currentPrimary = initialStatus.members.find(member => member.state === 1);

      console.log(`Current primary: ${currentPrimary?.name || 'Unknown'}`);

      // Simulate read/write operations during potential failover
      const operationsPromises = [];

      // Start continuous read operations
      operationsPromises.push(this.performContinuousReads());

      // Start continuous write operations
      operationsPromises.push(this.performContinuousWrites());

      // Monitor replica set status changes
      operationsPromises.push(this.monitorFailoverEvents());

      // Run operations for 60 seconds to demonstrate resilience
      console.log('Running continuous operations to test high availability...');
      await new Promise(resolve => setTimeout(resolve, 60000));

      console.log('High availability demonstration completed');

    } catch (error) {
      console.error('Failover demonstration failed:', error);
    }
  }

  async performContinuousReads() {
    const testCollection = this.db.collection('financial_transactions');
    let readCount = 0;
    let errorCount = 0;

    const readInterval = setInterval(async () => {
      try {
        // Perform read with secondaryPreferred to demonstrate load balancing
        await testCollection.find({}, {
          readPreference: 'secondaryPreferred',
          maxStalenessSeconds: 90
        }).limit(10).toArray();

        readCount++;

        if (readCount % 10 === 0) {
          console.log(`Continuous reads: ${readCount} successful, ${errorCount} errors`);
        }

      } catch (error) {
        errorCount++;
        console.warn(`Read operation failed: ${error.message}`);
      }
    }, 2000); // Read every 2 seconds

    // Stop after 60 seconds
    setTimeout(() => {
      clearInterval(readInterval);
      console.log(`Final read stats: ${readCount} successful, ${errorCount} errors`);
    }, 60000);
  }

  async performContinuousWrites() {
    const testCollection = this.db.collection('ha_test_operations');
    let writeCount = 0;
    let errorCount = 0;

    const writeInterval = setInterval(async () => {
      try {
        // Perform write with majority write concern for durability
        await testCollection.insertOne({
          operationType: 'ha_test_write',
          timestamp: new Date(),
          counter: writeCount + 1,
          testData: `Continuous write operation ${writeCount + 1}`
        }, {
          writeConcern: { w: 'majority', j: true, wtimeout: 5000 }
        });

        writeCount++;

        if (writeCount % 5 === 0) {
          console.log(`Continuous writes: ${writeCount} successful, ${errorCount} errors`);
        }

      } catch (error) {
        errorCount++;
        console.warn(`Write operation failed: ${error.message}`);

        // Implement exponential backoff on write failures
        await new Promise(resolve => setTimeout(resolve, Math.min(1000 * Math.pow(2, errorCount), 10000)));
      }
    }, 5000); // Write every 5 seconds

    // Stop after 60 seconds
    setTimeout(() => {
      clearInterval(writeInterval);
      console.log(`Final write stats: ${writeCount} successful, ${errorCount} errors`);

      // Cleanup test documents
      testCollection.deleteMany({ operationType: 'ha_test_write' }).catch(console.warn);
    }, 60000);
  }

  async monitorFailoverEvents() {
    let lastPrimaryName = null;

    const monitorInterval = setInterval(async () => {
      try {
        const status = await this.admin.command({ replSetGetStatus: 1 });
        const currentPrimary = status.members.find(member => member.state === 1);
        const currentPrimaryName = currentPrimary?.name;

        if (lastPrimaryName && lastPrimaryName !== currentPrimaryName) {
          console.log(`🔄 FAILOVER DETECTED: Primary changed from ${lastPrimaryName} to ${currentPrimaryName || 'NONE'}`);

          // Log failover event
          await this.logFailoverEvent(lastPrimaryName, currentPrimaryName);
        }

        lastPrimaryName = currentPrimaryName;

      } catch (error) {
        console.warn('Failed to monitor replica set status:', error.message);
      }
    }, 5000); // Check every 5 seconds

    // Stop monitoring after 60 seconds
    setTimeout(() => {
      clearInterval(monitorInterval);
    }, 60000);
  }

  async logFailoverEvent(oldPrimary, newPrimary) {
    try {
      const eventsCollection = this.db.collection('failover_events');

      await eventsCollection.insertOne({
        eventType: 'primary_failover',
        timestamp: new Date(),
        oldPrimary: oldPrimary,
        newPrimary: newPrimary,
        detectedBy: 'ha_operations_monitor',
        environment: process.env.NODE_ENV || 'development'
      });

      console.log('Failover event logged to database');

    } catch (error) {
      console.warn('Failed to log failover event:', error);
    }
  }

  async performDataConsistencyCheck() {
    console.log('Performing data consistency check across replica set members...');

    try {
      const testCollection = this.db.collection('consistency_test');

      // Insert test document with majority write concern
      const testDoc = {
        consistencyTestId: new ObjectId(),
        timestamp: new Date(),
        testData: 'Data consistency verification document',
        checksum: 'test-checksum-12345'
      };

      const insertResult = await testCollection.insertOne(testDoc, {
        writeConcern: { w: 'majority', j: true, wtimeout: 10000 }
      });

      console.log(`Test document inserted with ID: ${insertResult.insertedId}`);

      // Wait a moment for replication
      await new Promise(resolve => setTimeout(resolve, 2000));

      // Read from primary
      const primaryRead = await testCollection.findOne(
        { _id: insertResult.insertedId },
        { readPreference: 'primary' }
      );

      // Read from secondary (with retry logic)
      let secondaryRead = null;
      let retryCount = 0;
      const maxRetries = 5;

      while (!secondaryRead && retryCount < maxRetries) {
        try {
          secondaryRead = await testCollection.findOne(
            { _id: insertResult.insertedId },
            { 
              readPreference: 'secondary',
              maxStalenessSeconds: 120 
            }
          );

          if (!secondaryRead) {
            retryCount++;
            console.log(`Retry ${retryCount}: Secondary read returned null, waiting for replication...`);
            await new Promise(resolve => setTimeout(resolve, 1000));
          }
        } catch (error) {
          retryCount++;
          console.warn(`Secondary read attempt ${retryCount} failed: ${error.message}`);
          if (retryCount < maxRetries) {
            await new Promise(resolve => setTimeout(resolve, 1000));
          }
        }
      }

      // Compare results
      const consistent = primaryRead && secondaryRead && 
        primaryRead.checksum === secondaryRead.checksum &&
        primaryRead.testData === secondaryRead.testData;

      console.log(`Data consistency check: ${consistent ? 'PASSED' : 'FAILED'}`);
      console.log(`Primary read successful: ${!!primaryRead}`);
      console.log(`Secondary read successful: ${!!secondaryRead}`);

      if (consistent) {
        console.log('✅ Data is consistent across replica set members');
      } else {
        console.warn('⚠️  Data inconsistency detected');
        console.log('Primary data:', primaryRead);
        console.log('Secondary data:', secondaryRead);
      }

      // Cleanup
      await testCollection.deleteOne({ _id: insertResult.insertedId });

      return {
        consistent,
        primaryReadSuccessful: !!primaryRead,
        secondaryReadSuccessful: !!secondaryRead,
        retryCount
      };

    } catch (error) {
      console.error('Data consistency check failed:', error);
      return {
        consistent: false,
        error: error.message
      };
    }
  }
}

// Example usage and demonstration
const demonstrateReplicaSetCapabilities = async () => {
  console.log('Starting MongoDB Replica Set demonstration...\n');

  try {
    // Setup enterprise collections
    const collections = await setupEnterpriseCollections();
    console.log('✅ Enterprise collections created\n');

    // Initialize replica set management
    const rsManager = new ReplicaSetManager(client, db);
    await rsManager.initializeReplicaSetMonitoring();
    console.log('✅ Replica set monitoring initialized\n');

    // Get replica set configuration and status
    const config = await rsManager.getReplicaSetConfig();
    const status = await rsManager.getReplicaSetStatus();

    console.log('📊 Replica Set Status:');
    console.log(`   Set Name: ${status.setName}`);
    console.log(`   Primary: ${status.primary?.name || 'None'}`);
    console.log(`   Total Members: ${status.members.length}`);
    console.log(`   Healthy Members: ${status.members.filter(m => m.health === 1).length}\n`);

    // Demonstrate read preferences
    await rsManager.demonstrateReadPreferences();
    console.log('✅ Read preferences demonstrated\n');

    // Demonstrate write concerns
    await rsManager.demonstrateWriteConcerns();
    console.log('✅ Write concerns demonstrated\n');

    // Test read/write operations
    const rwTest = await rsManager.performReadWriteOperations();
    console.log('✅ Read/write operations tested:', rwTest);
    console.log();

    // High availability operations
    const haOps = new HighAvailabilityOperations(client, db);

    // Perform data consistency check
    const consistencyResult = await haOps.performDataConsistencyCheck();
    console.log('✅ Data consistency checked:', consistencyResult);
    console.log();

    // Get final health summary
    const healthSummary = await rsManager.getHealthSummary();
    console.log('📋 Final Health Summary:', healthSummary);

    // Cleanup
    await rsManager.shutdown();
    console.log('\n🏁 Replica Set demonstration completed successfully');

  } catch (error) {
    console.error('❌ Demonstration failed:', error);
  }
};

// Export for use in applications
module.exports = {
  setupEnterpriseCollections,
  ReplicaSetManager,
  HighAvailabilityOperations,
  demonstrateReplicaSetCapabilities
};

// Benefits of MongoDB Replica Sets:
// - Automatic failover with no application intervention required
// - Built-in data redundancy across multiple servers and data centers
// - Configurable read preferences for performance optimization
// - Strong consistency guarantees with majority write concerns
// - Rolling upgrades and maintenance without downtime
// - Geographic distribution for disaster recovery
// - Automatic recovery from network partitions and server failures
// - Real-time replication with minimal lag
// - Integration with MongoDB Atlas for managed high availability
// - Comprehensive monitoring and alerting capabilities

Understanding MongoDB Replica Set Architecture

Replica Set Configuration Patterns

MongoDB Replica Sets provide several deployment patterns for different availability and performance requirements:

// Advanced replica set configuration patterns for enterprise deployments
class EnterpriseReplicaSetArchitecture {
  constructor(client) {
    this.client = client;
    this.admin = client.db('admin');
    this.architecturePatterns = new Map();
  }

  async setupProductionArchitecture() {
    console.log('Setting up enterprise production replica set architecture...');

    // Pattern 1: Standard 3-Member Production Setup
    const standardProductionConfig = {
      _id: "production-rs",
      version: 1,
      members: [
        {
          _id: 0,
          host: "db-primary-01.company.com:27017",
          priority: 10, // Higher priority = preferred primary
          votes: 1,
          buildIndexes: true,
          tags: { 
            region: "us-east-1", 
            datacenter: "primary",
            nodeType: "standard",
            environment: "production"
          }
        },
        {
          _id: 1,
          host: "db-secondary-01.company.com:27017", 
          priority: 5,
          votes: 1,
          buildIndexes: true,
          tags: { 
            region: "us-east-1", 
            datacenter: "primary",
            nodeType: "standard",
            environment: "production"
          }
        },
        {
          _id: 2,
          host: "db-secondary-02.company.com:27017",
          priority: 5,
          votes: 1, 
          buildIndexes: true,
          tags: { 
            region: "us-west-2", 
            datacenter: "secondary",
            nodeType: "standard", 
            environment: "production"
          }
        }
      ],
      settings: {
        heartbeatIntervalMillis: 2000, // 2 second heartbeat
        heartbeatTimeoutSecs: 10,      // 10 second timeout
        electionTimeoutMillis: 10000,  // 10 second election timeout
        chainingAllowed: true,         // Allow secondary-to-secondary replication
        getLastErrorModes: {
          "datacenterMajority": { "datacenter": 2 }, // Require writes to both datacenters
          "regionMajority": { "region": 2 }          // Require writes to both regions
        }
      }
    };

    // Pattern 2: 5-Member High Availability with Arbiter
    const highAvailabilityConfig = {
      _id: "ha-production-rs",
      version: 1,
      members: [
        // Primary datacenter members
        {
          _id: 0,
          host: "db-primary-01.company.com:27017",
          priority: 10,
          votes: 1,
          tags: { region: "us-east-1", datacenter: "primary" }
        },
        {
          _id: 1, 
          host: "db-secondary-01.company.com:27017",
          priority: 8,
          votes: 1,
          tags: { region: "us-east-1", datacenter: "primary" }
        },

        // Disaster recovery datacenter members
        {
          _id: 2,
          host: "db-dr-01.company.com:27017",
          priority: 2, // Lower priority for DR
          votes: 1,
          tags: { region: "us-west-2", datacenter: "dr" }
        },
        {
          _id: 3,
          host: "db-dr-02.company.com:27017", 
          priority: 1,
          votes: 1,
          tags: { region: "us-west-2", datacenter: "dr" }
        },

        // Arbiter for odd number of votes (lightweight)
        {
          _id: 4,
          host: "db-arbiter-01.company.com:27017",
          arbiterOnly: true,
          votes: 1,
          tags: { region: "us-central-1", datacenter: "arbiter" }
        }
      ],
      settings: {
        getLastErrorModes: {
          "crossDatacenter": { "datacenter": 2 },
          "disasterRecovery": { "datacenter": 1, "region": 2 }
        }
      }
    };

    // Pattern 3: Analytics-Optimized with Hidden Members
    const analyticsOptimizedConfig = {
      _id: "analytics-rs", 
      version: 1,
      members: [
        // Production data serving members
        {
          _id: 0,
          host: "db-primary-01.company.com:27017",
          priority: 10,
          votes: 1,
          tags: { workload: "transactional", region: "us-east-1" }
        },
        {
          _id: 1,
          host: "db-secondary-01.company.com:27017",
          priority: 5,
          votes: 1,
          tags: { workload: "transactional", region: "us-east-1" }
        },

        // Hidden analytics members (never become primary)
        {
          _id: 2,
          host: "db-analytics-01.company.com:27017",
          priority: 0,    // Cannot become primary
          votes: 0,       // Does not participate in elections
          hidden: true,   // Hidden from application discovery
          buildIndexes: true,
          tags: { 
            workload: "analytics", 
            region: "us-east-1",
            purpose: "reporting" 
          }
        },
        {
          _id: 3,
          host: "db-analytics-02.company.com:27017",
          priority: 0,
          votes: 0, 
          hidden: true,
          buildIndexes: true,
          tags: { 
            workload: "analytics",
            region: "us-east-1", 
            purpose: "etl"
          }
        },

        // Delayed member for disaster recovery
        {
          _id: 4,
          host: "db-delayed-01.company.com:27017",
          priority: 0,
          votes: 0,
          hidden: true,
          buildIndexes: true,
          secondaryDelaySecs: 3600, // 1 hour delay
          tags: { 
            workload: "recovery",
            region: "us-west-2",
            purpose: "delayed_backup"
          }
        }
      ]
    };

    // Store configurations for reference
    this.architecturePatterns.set('standard-production', standardProductionConfig);
    this.architecturePatterns.set('high-availability', highAvailabilityConfig);
    this.architecturePatterns.set('analytics-optimized', analyticsOptimizedConfig);

    console.log('Enterprise replica set architectures configured');
    return this.architecturePatterns;
  }

  async implementReadPreferenceStrategies() {
    console.log('Implementing enterprise read preference strategies...');

    // Strategy 1: Load balancing with geographic preference
    const geographicLoadBalancing = {
      // Primary application reads from nearest secondary
      applicationReads: {
        readPreference: 'secondaryPreferred',
        readPreferenceTags: [
          { region: 'us-east-1', datacenter: 'primary' }, // Prefer local datacenter
          { region: 'us-east-1' },                       // Fallback to region
          {}                                             // Final fallback to any
        ],
        maxStalenessSeconds: 60
      },

      // Analytics reads from dedicated hidden members
      analyticsReads: {
        readPreference: 'secondary',
        readPreferenceTags: [
          { workload: 'analytics', purpose: 'reporting' },
          { workload: 'analytics' }
        ],
        maxStalenessSeconds: 300 // 5 minutes acceptable for analytics
      },

      // Real-time dashboard reads (require fresh data)
      dashboardReads: {
        readPreference: 'primaryPreferred',
        maxStalenessSeconds: 30
      },

      // ETL and batch processing reads
      etlReads: {
        readPreference: 'secondary',
        readPreferenceTags: [
          { purpose: 'etl' },
          { workload: 'analytics' }
        ],
        maxStalenessSeconds: 600 // 10 minutes acceptable for ETL
      }
    };

    // Strategy 2: Write concern patterns for different operations
    const writeConcernStrategies = {
      // Critical financial transactions
      criticalWrites: {
        writeConcern: { 
          w: 'datacenterMajority',  // Custom write concern
          j: true,
          wtimeout: 10000
        },
        description: 'Ensures writes to multiple datacenters'
      },

      // Standard application writes
      standardWrites: {
        writeConcern: {
          w: 'majority',
          j: true, 
          wtimeout: 5000
        },
        description: 'Balances durability and performance'
      },

      // High-volume logging writes
      loggingWrites: {
        writeConcern: {
          w: 1,
          j: false,
          wtimeout: 1000
        },
        description: 'Optimized for throughput'
      },

      // Audit trail writes (maximum durability)
      auditWrites: {
        writeConcern: {
          w: 'regionMajority', // Custom write concern
          j: true,
          wtimeout: 15000
        },
        description: 'Ensures geographic distribution'
      }
    };

    return {
      readPreferences: geographicLoadBalancing,
      writeConcerns: writeConcernStrategies
    };
  }

  async setupMonitoringAndAlerting() {
    console.log('Setting up comprehensive replica set monitoring...');

    const monitoringMetrics = {
      // Replication lag monitoring
      replicationLag: {
        warning: 10,   // seconds
        critical: 30,  // seconds
        query: 'db.runCommand({replSetGetStatus: 1})'
      },

      // Member health monitoring
      memberHealth: {
        checkInterval: 30, // seconds
        alertThreshold: 2, // consecutive failures
        metrics: ['health', 'state', 'uptime', 'lastHeartbeat']
      },

      // Oplog monitoring
      oplogUtilization: {
        warning: 75,   // percent
        critical: 90,  // percent
        retentionTarget: 24 // hours
      },

      // Connection monitoring
      connectionMetrics: {
        maxConnections: 1000,
        warningThreshold: 800,
        monitorActiveConnections: true,
        trackConnectionSources: true
      },

      // Performance monitoring
      performanceMetrics: {
        slowQueryThreshold: 1000, // ms
        indexUsageTracking: true,
        collectionStatsMonitoring: true,
        operationProfiling: {
          enabled: true,
          slowMs: 100,
          sampleRate: 0.1 // 10% sampling
        }
      }
    };

    // Automated alert conditions
    const alertConditions = {
      criticalAlerts: [
        'No primary member available',
        'Majority of members down',
        'Replication lag > 30 seconds',
        'Oplog utilization > 90%'
      ],

      warningAlerts: [
        'Member health issues',
        'Replication lag > 10 seconds', 
        'High connection usage',
        'Slow query patterns detected'
      ],

      infoAlerts: [
        'Primary election occurred',
        'Member added/removed',
        'Configuration change',
        'Index build completed'
      ]
    };

    return {
      metrics: monitoringMetrics,
      alerts: alertConditions
    };
  }

  async performMaintenanceOperations() {
    console.log('Demonstrating maintenance operations...');

    try {
      // 1. Check replica set status before maintenance
      const preMaintenanceStatus = await this.admin.command({ replSetGetStatus: 1 });
      console.log('Pre-maintenance replica set status obtained');

      // 2. Demonstrate rolling maintenance (simulation)
      console.log('Simulating rolling maintenance procedures...');

      const maintenanceProcedures = {
        // Step-by-step rolling upgrade
        rollingUpgrade: [
          '1. Start with secondary members (lowest priority first)',
          '2. Stop MongoDB service on secondary',
          '3. Upgrade MongoDB binaries', 
          '4. Restart with new version',
          '5. Verify member rejoins and catches up',
          '6. Repeat for remaining secondaries',
          '7. Step down primary to make it secondary',
          '8. Upgrade former primary',
          '9. Allow automatic primary election'
        ],

        // Rolling configuration changes
        configurationUpdate: [
          '1. Update secondary member configurations',
          '2. Verify changes take effect',
          '3. Step down primary',
          '4. Update former primary configuration', 
          '5. Verify replica set health'
        ],

        // Index building strategy  
        indexMaintenance: [
          '1. Build indexes on secondaries first',
          '2. Use background: true for minimal impact',
          '3. Monitor replication lag during build',
          '4. Step down primary after secondary indexes complete',
          '5. Build index on former primary'
        ]
      };

      console.log('Rolling maintenance procedures defined:', Object.keys(maintenanceProcedures));

      // 3. Demonstrate graceful primary stepdown
      console.log('Demonstrating graceful primary stepdown...');

      try {
        // Check if we have a primary
        const currentPrimary = preMaintenanceStatus.members.find(m => m.state === 1);

        if (currentPrimary) {
          console.log(`Current primary: ${currentPrimary.name}`);

          // In a real scenario, you would step down the primary:
          // await this.admin.command({ replSetStepDown: 60 }); // Step down for 60 seconds

          console.log('Primary stepdown would be executed here (skipped in demo)');
        } else {
          console.log('No primary found - replica set may be in election');
        }

      } catch (error) {
        console.log('Primary stepdown simulation completed (expected in demo environment)');
      }

      // 4. Maintenance completion verification
      console.log('Verifying replica set health after maintenance...');

      const postMaintenanceChecks = {
        replicationLag: 'Check all members have acceptable lag',
        memberHealth: 'Verify all members are healthy',
        primaryElection: 'Confirm primary is elected and stable', 
        dataConsistency: 'Validate data consistency across members',
        applicationConnectivity: 'Test application reconnection',
        performanceBaseline: 'Confirm performance metrics are normal'
      };

      console.log('Post-maintenance verification checklist:', Object.keys(postMaintenanceChecks));

      return {
        maintenanceProcedures,
        preMaintenanceStatus: preMaintenanceStatus.members.length,
        postMaintenanceChecks: Object.keys(postMaintenanceChecks).length
      };

    } catch (error) {
      console.error('Maintenance operations demonstration failed:', error);
      throw error;
    }
  }

  async demonstrateDisasterRecovery() {
    console.log('Demonstrating disaster recovery capabilities...');

    const disasterRecoveryPlan = {
      // Scenario 1: Primary datacenter failure
      primaryDatacenterFailure: {
        detectionMethods: [
          'Automated health checks detect connectivity loss',
          'Application connection failures increase',
          'Monitoring systems report member unavailability'
        ],

        automaticResponse: [
          'Remaining members detect primary datacenter loss',
          'Automatic election occurs among surviving members',
          'New primary elected in secondary datacenter',
          'Applications automatically reconnect to new primary'
        ],

        recoverySteps: [
          'Verify new primary is stable and accepting writes',
          'Update DNS/load balancer if necessary', 
          'Monitor replication lag on remaining secondaries',
          'Plan primary datacenter restoration'
        ]
      },

      // Scenario 2: Network partition (split-brain prevention)
      networkPartition: {
        scenario: 'Network split isolates primary from majority of members',

        mongodbResponse: [
          'Primary detects loss of majority and steps down',
          'Primary becomes secondary (read-only)',
          'Majority partition elects new primary',
          'Split-brain scenario prevented by majority rule'
        ],

        resolution: [
          'Network partition heals automatically or manually',
          'Isolated member(s) rejoin replica set',
          'Data consistency maintained through oplog replay',
          'Normal operations resume'
        ]
      },

      // Scenario 3: Data corruption recovery
      dataCorruption: {
        detectionMethods: [
          'Checksum validation failures',
          'Application data integrity checks', 
          'MongoDB internal consistency checks'
        ],

        recoveryOptions: [
          'Restore from delayed secondary (if available)',
          'Point-in-time recovery from backup',
          'Partial data recovery and manual intervention',
          'Full replica set restoration from backup'
        ]
      }
    };

    // Demonstrate backup and recovery verification
    const backupVerification = await this.verifyBackupProcedures();

    return {
      disasterScenarios: Object.keys(disasterRecoveryPlan),
      backupVerification
    };
  }

  async verifyBackupProcedures() {
    console.log('Verifying backup and recovery procedures...');

    try {
      // Create a test collection for backup verification
      const backupTestCollection = this.client.db('backup_test').collection('test_data');

      // Insert test data
      await backupTestCollection.insertMany([
        { testId: 1, data: 'Backup verification data 1', timestamp: new Date() },
        { testId: 2, data: 'Backup verification data 2', timestamp: new Date() },
        { testId: 3, data: 'Backup verification data 3', timestamp: new Date() }
      ]);

      // Simulate backup verification steps
      const backupProcedures = {
        backupVerification: [
          'Verify mongodump/mongorestore functionality',
          'Test point-in-time recovery capabilities',
          'Validate backup file integrity',
          'Confirm backup storage accessibility'
        ],

        recoveryTesting: [
          'Restore backup to test environment',
          'Verify data completeness and integrity',
          'Test application connectivity to restored data',
          'Measure recovery time objectives (RTO)'
        ],

        continuousBackup: [
          'Oplog-based continuous backup',
          'Incremental backup strategies',
          'Cross-region backup replication', 
          'Automated backup validation'
        ]
      };

      // Read back test data to verify
      const verificationCount = await backupTestCollection.countDocuments();
      console.log(`Backup verification: ${verificationCount} test documents created`);

      // Cleanup
      await backupTestCollection.drop();

      return {
        backupProceduresVerified: Object.keys(backupProcedures).length,
        testDataVerified: verificationCount === 3
      };

    } catch (error) {
      console.warn('Backup verification failed:', error.message);
      return {
        backupProceduresVerified: 0,
        testDataVerified: false,
        error: error.message
      };
    }
  }

  getArchitectureRecommendations() {
    return {
      production: {
        minimumMembers: 3,
        recommendedMembers: 5,
        arbiterUsage: 'Only when even number of data-bearing members',
        geographicDistribution: 'Multiple datacenters recommended',
        hiddenMembers: 'Use for analytics and backup workloads'
      },

      performance: {
        readPreferences: 'Configure based on workload patterns',
        writeConcerns: 'Balance durability with performance requirements', 
        indexStrategy: 'Build on secondaries first during maintenance',
        connectionPooling: 'Configure appropriate pool sizes'
      },

      monitoring: {
        replicationLag: 'Monitor continuously with alerts',
        memberHealth: 'Automated health checking essential',
        oplogSize: 'Size for expected downtime windows',
        backupTesting: 'Regular backup and recovery testing'
      }
    };
  }
}

// Export the enterprise architecture class
module.exports = { EnterpriseReplicaSetArchitecture };

SQL-Style Replica Set Operations with QueryLeaf

QueryLeaf provides SQL-familiar syntax for MongoDB Replica Set configuration and monitoring:

-- QueryLeaf replica set operations with SQL-familiar syntax

-- Create and configure replica sets using SQL-style syntax
CREATE REPLICA SET production_rs WITH (
  members = [
    { 
      host = 'db-primary-01.company.com:27017',
      priority = 10,
      votes = 1,
      tags = { region = 'us-east-1', datacenter = 'primary' }
    },
    {
      host = 'db-secondary-01.company.com:27017', 
      priority = 5,
      votes = 1,
      tags = { region = 'us-east-1', datacenter = 'primary' }
    },
    {
      host = 'db-secondary-02.company.com:27017',
      priority = 5,
      votes = 1, 
      tags = { region = 'us-west-2', datacenter = 'secondary' }
    }
  ],
  settings = {
    heartbeat_interval = '2 seconds',
    election_timeout = '10 seconds',
    write_concern_modes = {
      datacenter_majority = { datacenter = 2 },
      cross_region = { region = 2 }
    }
  }
);

-- Monitor replica set health with SQL queries
SELECT 
  member_name,
  member_state,
  health_status,
  uptime_seconds,
  replication_lag_seconds,
  last_heartbeat,

  -- Health assessment
  CASE 
    WHEN health_status = 1 AND member_state = 'PRIMARY' THEN 'Healthy Primary'
    WHEN health_status = 1 AND member_state = 'SECONDARY' THEN 'Healthy Secondary'
    WHEN health_status = 0 THEN 'Unhealthy Member'
    ELSE 'Unknown Status'
  END as status_description,

  -- Performance indicators
  CASE
    WHEN replication_lag_seconds > 30 THEN 'High Lag'
    WHEN replication_lag_seconds > 10 THEN 'Moderate Lag'  
    ELSE 'Low Lag'
  END as lag_status,

  -- Connection quality
  CASE
    WHEN ping_ms < 10 THEN 'Excellent'
    WHEN ping_ms < 50 THEN 'Good'
    WHEN ping_ms < 100 THEN 'Fair'
    ELSE 'Poor'
  END as connection_quality

FROM replica_set_status('production_rs')
ORDER BY 
  CASE member_state
    WHEN 'PRIMARY' THEN 1
    WHEN 'SECONDARY' THEN 2
    ELSE 3
  END,
  member_name;

-- Advanced read preference configuration with SQL
SELECT 
  account_id,
  transaction_date,
  amount,
  transaction_type,
  status

FROM financial_transactions 
WHERE transaction_date >= CURRENT_DATE - INTERVAL '30 days'
  AND account_id = '12345'

-- Use read preference for load balancing
WITH READ_PREFERENCE = 'secondary_preferred'
WITH READ_PREFERENCE_TAGS = [
  { region = 'us-east-1', datacenter = 'primary' },
  { region = 'us-east-1' },
  { }  -- fallback to any available member
]
WITH MAX_STALENESS = '60 seconds'

ORDER BY transaction_date DESC
LIMIT 100;

-- Write operations with custom write concerns
INSERT INTO critical_financial_data (
  transaction_id,
  account_from,
  account_to, 
  amount,
  transaction_type,
  created_at
) VALUES (
  'TXN-2025-001234',
  'ACC-123456789',
  'ACC-987654321', 
  1500.00,
  'wire_transfer',
  CURRENT_TIMESTAMP
)
-- Ensure write to multiple datacenters for critical data
WITH WRITE_CONCERN = {
  w = 'datacenter_majority',
  journal = true,
  timeout = '10 seconds'
};

-- Comprehensive replica set analytics
WITH replica_set_metrics AS (
  SELECT 
    rs.replica_set_name,
    rs.member_name,
    rs.member_state,
    rs.health_status,
    rs.uptime_seconds,
    rs.replication_lag_seconds,
    rs.ping_ms,

    -- Time-based analysis
    DATE_TRUNC('hour', rs.check_timestamp) as hour_bucket,
    DATE_TRUNC('day', rs.check_timestamp) as day_bucket,

    -- Performance categorization
    CASE 
      WHEN rs.replication_lag_seconds <= 5 THEN 'excellent'
      WHEN rs.replication_lag_seconds <= 15 THEN 'good'
      WHEN rs.replication_lag_seconds <= 30 THEN 'fair'
      ELSE 'poor'
    END as replication_performance,

    CASE
      WHEN rs.ping_ms <= 10 THEN 'excellent'
      WHEN rs.ping_ms <= 25 THEN 'good'
      WHEN rs.ping_ms <= 50 THEN 'fair'
      ELSE 'poor'
    END as connection_performance

  FROM replica_set_health_history rs
  WHERE rs.check_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
),
performance_summary AS (
  SELECT 
    replica_set_name,
    hour_bucket,

    -- Member availability
    COUNT(*) as total_checks,
    COUNT(*) FILTER (WHERE health_status = 1) as healthy_checks,
    ROUND((COUNT(*) FILTER (WHERE health_status = 1)::numeric / COUNT(*)) * 100, 2) as availability_percent,

    -- Primary stability
    COUNT(DISTINCT member_name) FILTER (WHERE member_state = 'PRIMARY') as primary_changes,

    -- Replication performance
    AVG(replication_lag_seconds) as avg_replication_lag,
    MAX(replication_lag_seconds) as max_replication_lag,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY replication_lag_seconds) as p95_replication_lag,

    -- Connection performance  
    AVG(ping_ms) as avg_ping_ms,
    MAX(ping_ms) as max_ping_ms,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY ping_ms) as p95_ping_ms,

    -- Performance distribution
    COUNT(*) FILTER (WHERE replication_performance = 'excellent') as excellent_replication_count,
    COUNT(*) FILTER (WHERE replication_performance = 'good') as good_replication_count,
    COUNT(*) FILTER (WHERE replication_performance = 'fair') as fair_replication_count,
    COUNT(*) FILTER (WHERE replication_performance = 'poor') as poor_replication_count,

    COUNT(*) FILTER (WHERE connection_performance = 'excellent') as excellent_connection_count,
    COUNT(*) FILTER (WHERE connection_performance = 'good') as good_connection_count,
    COUNT(*) FILTER (WHERE connection_performance = 'fair') as fair_connection_count,
    COUNT(*) FILTER (WHERE connection_performance = 'poor') as poor_connection_count

  FROM replica_set_metrics
  GROUP BY replica_set_name, hour_bucket
),
alerting_analysis AS (
  SELECT 
    ps.*,

    -- SLA compliance (99.9% availability target)
    CASE WHEN ps.availability_percent >= 99.9 THEN 'SLA_COMPLIANT' ELSE 'SLA_BREACH' END as sla_status,

    -- Performance alerts
    CASE 
      WHEN ps.avg_replication_lag > 30 THEN 'CRITICAL_LAG'
      WHEN ps.avg_replication_lag > 15 THEN 'WARNING_LAG' 
      ELSE 'NORMAL_LAG'
    END as lag_alert_level,

    CASE
      WHEN ps.primary_changes > 1 THEN 'UNSTABLE_PRIMARY'
      WHEN ps.primary_changes = 1 THEN 'PRIMARY_CHANGE'
      ELSE 'STABLE_PRIMARY'
    END as primary_stability,

    -- Recommendations
    CASE
      WHEN ps.availability_percent < 99.0 THEN 'Investigate member failures and network issues'
      WHEN ps.avg_replication_lag > 30 THEN 'Check network bandwidth and server performance'
      WHEN ps.primary_changes > 1 THEN 'Analyze primary election patterns and member priorities'
      WHEN ps.avg_ping_ms > 50 THEN 'Investigate network latency between members'
      ELSE 'Performance within acceptable parameters'
    END as recommendation

  FROM performance_summary ps
)
SELECT 
  aa.replica_set_name,
  TO_CHAR(aa.hour_bucket, 'YYYY-MM-DD HH24:00') as monitoring_hour,

  -- Availability metrics
  aa.total_checks,
  aa.availability_percent,
  aa.sla_status,

  -- Replication metrics
  ROUND(aa.avg_replication_lag::numeric, 2) as avg_lag_seconds,
  ROUND(aa.max_replication_lag::numeric, 2) as max_lag_seconds,
  ROUND(aa.p95_replication_lag::numeric, 2) as p95_lag_seconds,
  aa.lag_alert_level,

  -- Connection metrics
  ROUND(aa.avg_ping_ms::numeric, 1) as avg_ping_ms,
  ROUND(aa.max_ping_ms::numeric, 1) as max_ping_ms,

  -- Stability metrics
  aa.primary_changes,
  aa.primary_stability,

  -- Performance distribution
  CONCAT(
    'Excellent: ', aa.excellent_replication_count, 
    ', Good: ', aa.good_replication_count,
    ', Fair: ', aa.fair_replication_count,
    ', Poor: ', aa.poor_replication_count
  ) as replication_distribution,

  -- Operational insights
  aa.recommendation,

  -- Trend indicators
  LAG(aa.availability_percent) OVER (
    PARTITION BY aa.replica_set_name 
    ORDER BY aa.hour_bucket
  ) as prev_hour_availability,

  aa.availability_percent - LAG(aa.availability_percent) OVER (
    PARTITION BY aa.replica_set_name 
    ORDER BY aa.hour_bucket  
  ) as availability_trend

FROM alerting_analysis aa
ORDER BY aa.replica_set_name, aa.hour_bucket DESC;

-- Failover simulation and testing
CREATE PROCEDURE test_failover_scenario(
  replica_set_name VARCHAR(100),
  test_type VARCHAR(50) -- 'primary_stepdown', 'network_partition', 'member_failure'
) AS
BEGIN
  -- Record test start
  INSERT INTO failover_tests (
    replica_set_name,
    test_type,
    test_start_time,
    status
  ) VALUES (
    replica_set_name,
    test_type,
    CURRENT_TIMESTAMP,
    'running'
  );

  -- Execute test based on type
  CASE test_type
    WHEN 'primary_stepdown' THEN
      -- Gracefully step down current primary
      CALL replica_set_step_down(replica_set_name, 60); -- 60 second stepdown

    WHEN 'network_partition' THEN
      -- Simulate network partition (requires external orchestration)
      CALL simulate_network_partition(replica_set_name, '30 seconds');

    WHEN 'member_failure' THEN
      -- Simulate member failure (test environment only)
      CALL simulate_member_failure(replica_set_name, 'secondary', '60 seconds');
  END CASE;

  -- Monitor failover process
  CALL monitor_failover_recovery(replica_set_name);

  -- Record test completion
  UPDATE failover_tests 
  SET 
    test_end_time = CURRENT_TIMESTAMP,
    status = 'completed',
    recovery_time_seconds = EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - test_start_time))
  WHERE replica_set_name = replica_set_name 
    AND test_start_time = (
      SELECT MAX(test_start_time) 
      FROM failover_tests 
      WHERE replica_set_name = replica_set_name
    );
END;

-- Backup and recovery verification
WITH backup_verification AS (
  SELECT 
    backup_name,
    backup_timestamp,
    backup_size_gb,
    backup_type, -- 'full', 'incremental', 'oplog'

    -- Backup validation
    backup_integrity_check,
    restoration_test_status,

    -- Recovery metrics
    estimated_recovery_time_minutes,
    recovery_point_objective_minutes,
    recovery_time_objective_minutes,

    -- Geographic distribution
    backup_locations,
    cross_region_replication_status

  FROM backup_history
  WHERE backup_timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days'
    AND backup_type IN ('full', 'incremental')
),
recovery_readiness AS (
  SELECT 
    COUNT(*) as total_backups,
    COUNT(*) FILTER (WHERE backup_integrity_check = 'passed') as verified_backups,
    COUNT(*) FILTER (WHERE restoration_test_status = 'success') as tested_backups,

    AVG(estimated_recovery_time_minutes) as avg_recovery_time,
    MAX(estimated_recovery_time_minutes) as max_recovery_time,

    -- Compliance assessment
    CASE 
      WHEN COUNT(*) FILTER (WHERE restoration_test_status = 'success') >= 3 THEN 'compliant'
      WHEN COUNT(*) FILTER (WHERE restoration_test_status = 'success') >= 1 THEN 'warning'
      ELSE 'non_compliant'
    END as backup_testing_compliance,

    -- Geographic redundancy
    COUNT(DISTINCT backup_locations) as backup_site_count,

    -- Recommendations
    CASE
      WHEN COUNT(*) FILTER (WHERE restoration_test_status = 'success') = 0 THEN 
        'Schedule immediate backup restoration testing'
      WHEN AVG(estimated_recovery_time_minutes) > recovery_time_objective_minutes THEN
        'Optimize backup strategy to meet RTO requirements'
      WHEN COUNT(DISTINCT backup_locations) < 2 THEN
        'Implement geographic backup distribution'
      ELSE 'Backup and recovery strategy meets requirements'
    END as recommendation

  FROM backup_verification
)
SELECT 
  rr.total_backups,
  rr.verified_backups,
  rr.tested_backups,
  ROUND((rr.tested_backups::numeric / rr.total_backups) * 100, 1) as testing_coverage_percent,

  ROUND(rr.avg_recovery_time, 1) as avg_recovery_time_minutes,
  ROUND(rr.max_recovery_time, 1) as max_recovery_time_minutes,

  rr.backup_testing_compliance,
  rr.backup_site_count,
  rr.recommendation

FROM recovery_readiness rr;

-- QueryLeaf provides comprehensive replica set capabilities:
-- 1. SQL-familiar replica set configuration and management syntax
-- 2. Advanced monitoring and alerting with SQL aggregation functions
-- 3. Read preference and write concern configuration using SQL expressions
-- 4. Comprehensive health analytics with time-series analysis
-- 5. Automated failover testing and recovery verification procedures
-- 6. Backup and recovery management with compliance tracking
-- 7. Performance optimization recommendations based on SQL analytics
-- 8. Integration with existing SQL-based monitoring and reporting systems
-- 9. Geographic distribution and disaster recovery planning with SQL queries
-- 10. Enterprise-grade high availability management using familiar SQL patterns

Best Practices for Replica Set Implementation

Production Deployment Strategies

Essential practices for enterprise MongoDB Replica Set deployments:

Member Configuration: Deploy odd numbers of voting members to prevent election ties
Geographic Distribution: Distribute members across multiple data centers for disaster recovery
Priority Settings: Configure member priorities to control primary election preferences
Hidden Members: Use hidden members for analytics workloads without affecting elections
Arbiter Usage: Deploy arbiters only when necessary to maintain odd voting member counts
Write Concerns: Configure appropriate write concerns for data durability requirements

Performance and Monitoring

Optimize Replica Sets for high-performance, production environments:

Read Preferences: Configure read preferences to distribute load and optimize performance
Replication Lag: Monitor replication lag continuously with automated alerting
Oplog Sizing: Size oplog appropriately for expected maintenance windows and downtime
Connection Pooling: Configure connection pools for optimal resource utilization
Index Building: Build indexes on secondaries first during maintenance windows
Health Monitoring: Implement comprehensive health checking and automated recovery

Conclusion

MongoDB Replica Sets provide comprehensive high availability and data resilience that eliminates the complexity and operational overhead of traditional database replication solutions while ensuring automatic failover, data consistency, and geographic distribution. The native integration with MongoDB's distributed architecture, combined with configurable read preferences and write concerns, makes building highly available applications both powerful and operationally simple.

Key Replica Set benefits include:

Automatic Failover: Seamless primary election and failover without manual intervention
Data Redundancy: Built-in data replication across multiple servers and geographic regions
Read Scaling: Configurable read preferences for optimal performance and load distribution
Strong Consistency: Majority write concerns ensure data durability and consistency
Zero-Downtime Maintenance: Rolling upgrades and maintenance without service interruption
Geographic Distribution: Cross-region deployment for disaster recovery and compliance

Whether you're building financial systems, e-commerce platforms, healthcare applications, or any mission-critical system requiring high availability, MongoDB Replica Sets with QueryLeaf's familiar SQL interface provides the foundation for enterprise-grade data resilience. This combination enables you to implement sophisticated high availability architectures while preserving familiar database interaction patterns.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB Replica Set operations while providing SQL-familiar configuration syntax, monitoring queries, and health analytics functions. Advanced replica set management, read preference configuration, and failover testing are seamlessly handled through familiar SQL patterns, making high availability database management both powerful and accessible.

The integration of native high availability capabilities with SQL-style operations makes MongoDB an ideal platform for applications requiring both enterprise-grade resilience and familiar database interaction patterns, ensuring your high availability solutions remain both effective and maintainable as they scale and evolve.

November 9, 2025
28 min read

MongoDB Aggregation Framework: Advanced Analytics and Real-Time Data Transformations for Enterprise Applications

Modern enterprise applications require sophisticated data processing capabilities that can handle complex transformations, real-time analytics, and multi-stage data aggregations with high performance and scalability. Traditional database approaches often struggle with complex analytical queries, requiring expensive joins, subqueries, and multiple round trips that create performance bottlenecks and operational complexity in production environments.

MongoDB's Aggregation Framework provides comprehensive data processing pipelines that enable sophisticated analytics, transformations, and real-time computations within the database itself. Unlike traditional SQL approaches that require complex joins and expensive operations, MongoDB's aggregation pipelines deliver optimized, single-pass data processing with automatic query optimization, distributed processing capabilities, and native support for complex document transformations.

The Traditional Analytics Challenge

Conventional relational database approaches to complex analytics face significant performance and scalability limitations:

-- Traditional PostgreSQL analytics - complex joins and expensive operations

-- Multi-table sales analytics with complex aggregations
WITH customer_segments AS (
    SELECT 
        c.customer_id,
        c.customer_name,
        c.email,
        c.registration_date,
        c.country,
        c.state,

        -- Customer segmentation logic
        CASE 
            WHEN c.registration_date >= CURRENT_DATE - INTERVAL '90 days' THEN 'new_customer'
            WHEN c.last_order_date >= CURRENT_DATE - INTERVAL '30 days' THEN 'active_customer'
            WHEN c.last_order_date >= CURRENT_DATE - INTERVAL '180 days' THEN 'returning_customer'
            ELSE 'dormant_customer'
        END as customer_segment,

        -- Calculate customer lifetime metrics
        c.total_orders,
        c.total_spent,
        c.average_order_value,
        c.last_order_date,

        -- Geographic classification
        CASE 
            WHEN c.country = 'US' THEN 'domestic'
            WHEN c.country IN ('CA', 'MX') THEN 'north_america'
            WHEN c.country IN ('GB', 'DE', 'FR', 'IT', 'ES') THEN 'europe'
            ELSE 'international'
        END as geographic_segment

    FROM customers c
    WHERE c.is_active = true
),

order_analytics AS (
    SELECT 
        o.order_id,
        o.customer_id,
        o.order_date,
        o.order_status,
        o.total_amount,
        o.discount_amount,
        o.tax_amount,
        o.shipping_amount,

        -- Time-based analytics
        DATE_TRUNC('month', o.order_date) as order_month,
        DATE_TRUNC('quarter', o.order_date) as order_quarter,
        DATE_TRUNC('year', o.order_date) as order_year,
        EXTRACT(dow FROM o.order_date) as day_of_week,
        EXTRACT(hour FROM o.order_date) as hour_of_day,

        -- Order categorization
        CASE 
            WHEN o.total_amount >= 1000 THEN 'high_value'
            WHEN o.total_amount >= 500 THEN 'medium_value'
            WHEN o.total_amount >= 100 THEN 'low_value'
            ELSE 'micro_transaction'
        END as order_value_segment,

        -- Seasonal analysis
        CASE 
            WHEN EXTRACT(month FROM o.order_date) IN (12, 1, 2) THEN 'winter'
            WHEN EXTRACT(month FROM o.order_date) IN (3, 4, 5) THEN 'spring'
            WHEN EXTRACT(month FROM o.order_date) IN (6, 7, 8) THEN 'summer'
            ELSE 'fall'
        END as season,

        -- Payment method analysis
        o.payment_method,
        o.payment_processor,

        -- Fulfillment metrics
        o.shipping_method,
        o.warehouse_id,
        EXTRACT(EPOCH FROM (o.shipped_date - o.order_date)) / 86400 as fulfillment_days

    FROM orders o
    WHERE o.order_date >= CURRENT_DATE - INTERVAL '2 years'
      AND o.order_status IN ('completed', 'shipped', 'delivered')
),

product_analytics AS (
    SELECT 
        oi.order_id,
        oi.product_id,
        p.product_name,
        p.category,
        p.subcategory,
        p.brand,
        p.supplier_id,
        oi.quantity,
        oi.unit_price,
        oi.total_price,
        oi.discount_amount as item_discount,

        -- Product performance metrics
        p.cost_per_unit,
        (oi.unit_price - p.cost_per_unit) as unit_margin,
        (oi.unit_price - p.cost_per_unit) * oi.quantity as total_margin,

        -- Product categorization
        CASE 
            WHEN p.category = 'Electronics' THEN 'tech'
            WHEN p.category IN ('Clothing', 'Shoes', 'Accessories') THEN 'fashion'
            WHEN p.category IN ('Home', 'Garden', 'Furniture') THEN 'home'
            ELSE 'other'
        END as product_group,

        -- Inventory and supply chain
        p.current_stock,
        p.reorder_level,
        CASE 
            WHEN p.current_stock <= p.reorder_level THEN 'low_stock'
            WHEN p.current_stock <= p.reorder_level * 2 THEN 'medium_stock'
            ELSE 'high_stock'
        END as stock_status,

        -- Supplier performance
        s.supplier_name,
        s.supplier_rating,
        s.average_lead_time

    FROM order_items oi
    JOIN products p ON oi.product_id = p.product_id
    JOIN suppliers s ON p.supplier_id = s.supplier_id
    WHERE p.is_active = true
),

comprehensive_sales_analytics AS (
    SELECT 
        cs.customer_id,
        cs.customer_name,
        cs.customer_segment,
        cs.geographic_segment,

        oa.order_id,
        oa.order_date,
        oa.order_month,
        oa.order_quarter,
        oa.order_value_segment,
        oa.season,
        oa.payment_method,
        oa.shipping_method,
        oa.fulfillment_days,

        pa.product_id,
        pa.product_name,
        pa.category,
        pa.brand,
        pa.product_group,
        pa.quantity,
        pa.unit_price,
        pa.total_price,
        pa.total_margin,
        pa.stock_status,
        pa.supplier_name,

        -- Advanced calculations requiring window functions
        SUM(pa.total_price) OVER (
            PARTITION BY cs.customer_id, oa.order_month
        ) as customer_monthly_spend,

        AVG(pa.unit_price) OVER (
            PARTITION BY pa.category, oa.order_quarter
        ) as category_avg_price_quarterly,

        ROW_NUMBER() OVER (
            PARTITION BY cs.customer_id 
            ORDER BY oa.order_date DESC
        ) as customer_order_recency,

        RANK() OVER (
            PARTITION BY oa.order_month 
            ORDER BY pa.total_margin DESC
        ) as product_margin_rank_monthly,

        -- Complex aggregations with multiple groupings
        COUNT(*) OVER (
            PARTITION BY cs.geographic_segment, oa.season
        ) as segment_seasonal_order_count,

        SUM(pa.total_price) OVER (
            PARTITION BY pa.brand, oa.order_quarter
        ) as brand_quarterly_revenue

    FROM customer_segments cs
    JOIN order_analytics oa ON cs.customer_id = oa.customer_id
    JOIN product_analytics pa ON oa.order_id = pa.order_id
),

performance_metrics AS (
    SELECT 
        csa.*,

        -- Customer behavior analysis
        CASE 
            WHEN customer_order_recency <= 3 THEN 'frequent_buyer'
            WHEN customer_order_recency <= 10 THEN 'regular_buyer'
            ELSE 'occasional_buyer'
        END as buying_frequency,

        -- Product performance analysis
        CASE 
            WHEN product_margin_rank_monthly <= 10 THEN 'top_margin_product'
            WHEN product_margin_rank_monthly <= 50 THEN 'good_margin_product'
            ELSE 'low_margin_product'
        END as margin_performance,

        -- Market analysis
        ROUND(
            (customer_monthly_spend / NULLIF(segment_seasonal_order_count::DECIMAL, 0)) * 100, 
            2
        ) as customer_segment_contribution_pct,

        ROUND(
            (brand_quarterly_revenue / SUM(brand_quarterly_revenue) OVER ()) * 100,
            2
        ) as brand_market_share_pct

    FROM comprehensive_sales_analytics csa
)

SELECT 
    -- Dimensional attributes
    customer_segment,
    geographic_segment,
    order_quarter,
    season,
    product_group,
    category,
    brand,
    payment_method,
    shipping_method,

    -- Aggregated metrics
    COUNT(DISTINCT customer_id) as unique_customers,
    COUNT(DISTINCT order_id) as total_orders,
    COUNT(DISTINCT product_id) as unique_products,

    -- Revenue metrics
    SUM(total_price) as total_revenue,
    AVG(total_price) as avg_order_value,
    SUM(total_margin) as total_margin,
    ROUND(AVG(total_margin), 2) as avg_margin_per_item,
    ROUND((SUM(total_margin) / SUM(total_price)) * 100, 1) as margin_percentage,

    -- Customer metrics
    AVG(customer_monthly_spend) as avg_customer_monthly_spend,
    COUNT(DISTINCT CASE WHEN buying_frequency = 'frequent_buyer' THEN customer_id END) as frequent_buyers,
    COUNT(DISTINCT CASE WHEN buying_frequency = 'regular_buyer' THEN customer_id END) as regular_buyers,

    -- Product performance
    COUNT(CASE WHEN margin_performance = 'top_margin_product' THEN 1 END) as top_margin_products,
    AVG(category_avg_price_quarterly) as avg_category_price,

    -- Operational metrics
    AVG(fulfillment_days) as avg_fulfillment_days,
    COUNT(CASE WHEN stock_status = 'low_stock' THEN 1 END) as low_stock_items,
    COUNT(DISTINCT supplier_name) as unique_suppliers,

    -- Time-based trends
    AVG(brand_market_share_pct) as avg_brand_market_share,
    ROUND(AVG(customer_segment_contribution_pct), 1) as avg_segment_contribution,

    -- Growth indicators (comparing to previous period)
    LAG(SUM(total_price)) OVER (
        PARTITION BY customer_segment, geographic_segment, product_group
        ORDER BY order_quarter
    ) as prev_quarter_revenue,

    ROUND(
        ((SUM(total_price) - LAG(SUM(total_price)) OVER (
            PARTITION BY customer_segment, geographic_segment, product_group
            ORDER BY order_quarter
        )) / NULLIF(LAG(SUM(total_price)) OVER (
            PARTITION BY customer_segment, geographic_segment, product_group
            ORDER BY order_quarter
        ), 0)) * 100,
        1
    ) as revenue_growth_pct

FROM performance_metrics
GROUP BY 
    customer_segment, geographic_segment, order_quarter, season,
    product_group, category, brand, payment_method, shipping_method
HAVING 
    COUNT(DISTINCT customer_id) >= 10  -- Filter for statistical significance
    AND SUM(total_price) >= 1000       -- Minimum revenue threshold
ORDER BY 
    order_quarter DESC,
    total_revenue DESC,
    unique_customers DESC
LIMIT 1000;

-- Problems with traditional SQL analytics approach:
-- 1. Extremely complex query structure with multiple CTEs and window functions
-- 2. Expensive JOIN operations across multiple large tables
-- 3. Poor performance due to multiple aggregation passes
-- 4. Limited support for nested data structures and arrays
-- 5. Difficult to maintain and modify complex analytical logic
-- 6. Memory-intensive operations with large intermediate result sets
-- 7. No native support for document-based data transformations
-- 8. Complex indexing requirements for optimal performance
-- 9. Difficult real-time processing due to query complexity
-- 10. Limited horizontal scaling for large analytical workloads

-- MySQL analytical limitations (even more restrictive)
SELECT 
    c.customer_segment,
    DATE_FORMAT(o.order_date, '%Y-%m') as order_month,
    COUNT(DISTINCT o.customer_id) as customers,
    SUM(o.total_amount) as revenue,
    AVG(o.total_amount) as avg_order_value
FROM (
    SELECT 
        customer_id,
        CASE 
            WHEN registration_date >= DATE_SUB(NOW(), INTERVAL 90 DAY) THEN 'new'
            WHEN last_order_date >= DATE_SUB(NOW(), INTERVAL 30 DAY) THEN 'active'  
            ELSE 'dormant'
        END as customer_segment
    FROM customers 
) c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= DATE_SUB(NOW(), INTERVAL 1 YEAR)
GROUP BY c.customer_segment, DATE_FORMAT(o.order_date, '%Y-%m')
ORDER BY order_month DESC, revenue DESC;

-- MySQL limitations for analytics:
-- - No window functions in older versions (pre-8.0)
-- - Limited CTE support 
-- - Poor JSON handling for complex nested data
-- - Basic aggregation functions only
-- - No advanced analytical functions
-- - Limited support for complex data transformations
-- - Poor performance with large analytical queries
-- - No native support for real-time streaming analytics

MongoDB Aggregation Framework provides powerful, optimized data processing pipelines:

// MongoDB Aggregation Framework - Comprehensive analytics and data transformation
const { MongoClient, ObjectId } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('enterprise_analytics');

// Comprehensive Enterprise Analytics with MongoDB Aggregation Framework
class AdvancedAnalyticsProcessor {
  constructor(db) {
    this.db = db;
    this.collections = {
      customers: db.collection('customers'),
      orders: db.collection('orders'),
      products: db.collection('products'),
      analytics: db.collection('analytics_results'),
      realTimeMetrics: db.collection('real_time_metrics')
    };

    // Performance optimization settings
    this.aggregationOptions = {
      allowDiskUse: true,
      maxTimeMS: 300000, // 5 minutes timeout
      hint: null, // Will be set dynamically based on query
      explain: false,
      comment: 'enterprise_analytics_query'
    };

    this.setupAnalyticsIndexes();
  }

  async setupAnalyticsIndexes() {
    console.log('Setting up optimized indexes for analytics...');

    try {
      // Customer collection indexes
      await this.collections.customers.createIndexes([
        { key: { customerId: 1 }, background: true, name: 'customer_id_idx' },
        { key: { registrationDate: -1, customerSegment: 1 }, background: true, name: 'registration_segment_idx' },
        { key: { 'address.country': 1, 'address.state': 1 }, background: true, name: 'geographic_idx' },
        { key: { loyaltyTier: 1, totalSpent: -1 }, background: true, name: 'loyalty_spending_idx' },
        { key: { lastOrderDate: -1, isActive: 1 }, background: true, name: 'activity_idx' }
      ]);

      // Orders collection indexes
      await this.collections.orders.createIndexes([
        { key: { customerId: 1, orderDate: -1 }, background: true, name: 'customer_date_idx' },
        { key: { orderDate: -1, status: 1 }, background: true, name: 'date_status_idx' },
        { key: { 'financial.total': -1, orderDate: -1 }, background: true, name: 'value_date_idx' },
        { key: { 'items.productId': 1, orderDate: -1 }, background: true, name: 'product_date_idx' },
        { key: { 'shipping.region': 1, orderDate: -1 }, background: true, name: 'region_date_idx' }
      ]);

      // Products collection indexes  
      await this.collections.products.createIndexes([
        { key: { productId: 1 }, background: true, name: 'product_id_idx' },
        { key: { category: 1, subcategory: 1 }, background: true, name: 'category_idx' },
        { key: { brand: 1, 'pricing.currentPrice': -1 }, background: true, name: 'brand_price_idx' },
        { key: { 'inventory.currentStock': 1, 'inventory.reorderLevel': 1 }, background: true, name: 'inventory_idx' },
        { key: { supplierId: 1, isActive: 1 }, background: true, name: 'supplier_active_idx' }
      ]);

      console.log('Analytics indexes created successfully');

    } catch (error) {
      console.error('Error creating analytics indexes:', error);
    }
  }

  async performComprehensiveCustomerAnalytics(timeRange = 'last_12_months', customerSegments = null) {
    console.log(`Performing comprehensive customer analytics for ${timeRange}...`);

    const startTime = Date.now();

    // Calculate date range
    const dateRanges = {
      'last_30_days': new Date(Date.now() - 30 * 24 * 60 * 60 * 1000),
      'last_90_days': new Date(Date.now() - 90 * 24 * 60 * 60 * 1000),
      'last_6_months': new Date(Date.now() - 6 * 30 * 24 * 60 * 60 * 1000),
      'last_12_months': new Date(Date.now() - 12 * 30 * 24 * 60 * 60 * 1000),
      'last_2_years': new Date(Date.now() - 2 * 365 * 24 * 60 * 60 * 1000)
    };

    const startDate = dateRanges[timeRange] || dateRanges['last_12_months'];

    const pipeline = [
      // Stage 1: Match orders within time range
      {
        $match: {
          orderDate: { $gte: startDate },
          status: { $in: ['completed', 'shipped', 'delivered'] }
        }
      },

      // Stage 2: Lookup customer information
      {
        $lookup: {
          from: 'customers',
          localField: 'customerId',
          foreignField: 'customerId',
          as: 'customer'
        }
      },

      // Stage 3: Unwind customer array (should be single document)
      {
        $unwind: '$customer'
      },

      // Stage 4: Filter by customer segments if specified
      ...(customerSegments ? [{
        $match: {
          'customer.segment': { $in: customerSegments }
        }
      }] : []),

      // Stage 5: Lookup product information for each order item
      {
        $lookup: {
          from: 'products',
          localField: 'items.productId',
          foreignField: 'productId',
          as: 'productDetails'
        }
      },

      // Stage 6: Add comprehensive calculated fields
      {
        $addFields: {
          // Time-based dimensions
          orderMonth: {
            $dateTrunc: {
              date: '$orderDate',
              unit: 'month'
            }
          },
          orderQuarter: {
            $concat: [
              { $toString: { $year: '$orderDate' } },
              '-Q',
              { $toString: {
                $ceil: { $divide: [{ $month: '$orderDate' }, 3] }
              }}
            ]
          },
          orderYear: { $year: '$orderDate' },
          dayOfWeek: { $dayOfWeek: '$orderDate' },
          hourOfDay: { $hour: '$orderDate' },

          // Seasonal classification
          season: {
            $switch: {
              branches: [
                {
                  case: { $in: [{ $month: '$orderDate' }, [12, 1, 2]] },
                  then: 'winter'
                },
                {
                  case: { $in: [{ $month: '$orderDate' }, [3, 4, 5]] },
                  then: 'spring'
                },
                {
                  case: { $in: [{ $month: '$orderDate' }, [6, 7, 8]] },
                  then: 'summer'
                }
              ],
              default: 'fall'
            }
          },

          // Customer segmentation
          customerSegment: {
            $switch: {
              branches: [
                {
                  case: {
                    $gte: [
                      '$customer.registrationDate',
                      new Date(Date.now() - 90 * 24 * 60 * 60 * 1000)
                    ]
                  },
                  then: 'new_customer'
                },
                {
                  case: {
                    $gte: [
                      '$customer.lastOrderDate',
                      new Date(Date.now() - 30 * 24 * 60 * 60 * 1000)
                    ]
                  },
                  then: 'active_customer'
                },
                {
                  case: {
                    $gte: [
                      '$customer.lastOrderDate',
                      new Date(Date.now() - 180 * 24 * 60 * 60 * 1000)
                    ]
                  },
                  then: 'returning_customer'
                }
              ],
              default: 'dormant_customer'
            }
          },

          // Geographic classification
          geographicSegment: {
            $switch: {
              branches: [
                {
                  case: { $eq: ['$customer.address.country', 'US'] },
                  then: 'domestic'
                },
                {
                  case: { $in: ['$customer.address.country', ['CA', 'MX']] },
                  then: 'north_america'
                },
                {
                  case: { $in: ['$customer.address.country', ['GB', 'DE', 'FR', 'IT', 'ES']] },
                  then: 'europe'
                }
              ],
              default: 'international'
            }
          },

          // Order value classification
          orderValueSegment: {
            $switch: {
              branches: [
                {
                  case: { $gte: ['$financial.total', 1000] },
                  then: 'high_value'
                },
                {
                  case: { $gte: ['$financial.total', 500] },
                  then: 'medium_value'
                },
                {
                  case: { $gte: ['$financial.total', 100] },
                  then: 'low_value'
                }
              ],
              default: 'micro_transaction'
            }
          },

          // Enhanced item analysis with product details
          enrichedItems: {
            $map: {
              input: '$items',
              as: 'item',
              in: {
                $mergeObjects: [
                  '$$item',
                  {
                    productDetails: {
                      $arrayElemAt: [
                        {
                          $filter: {
                            input: '$productDetails',
                            cond: { $eq: ['$$this.productId', '$$item.productId'] }
                          }
                        },
                        0
                      ]
                    }
                  },
                  {
                    // Calculate margins and performance metrics
                    unitMargin: {
                      $subtract: [
                        '$$item.unitPrice',
                        {
                          $arrayElemAt: [
                            {
                              $map: {
                                input: {
                                  $filter: {
                                    input: '$productDetails',
                                    cond: { $eq: ['$$this.productId', '$$item.productId'] }
                                  }
                                },
                                in: '$$this.costPerUnit'
                              }
                            },
                            0
                          ]
                        }
                      ]
                    },

                    categoryGroup: {
                      $let: {
                        vars: {
                          category: {
                            $arrayElemAt: [
                              {
                                $map: {
                                  input: {
                                    $filter: {
                                      input: '$productDetails',
                                      cond: { $eq: ['$$this.productId', '$$item.productId'] }
                                    }
                                  },
                                  in: '$$this.category'
                                }
                              },
                              0
                            ]
                          }
                        },
                        in: {
                          $switch: {
                            branches: [
                              { case: { $eq: ['$$category', 'Electronics'] }, then: 'tech' },
                              { case: { $in: ['$$category', ['Clothing', 'Shoes', 'Accessories']] }, then: 'fashion' },
                              { case: { $in: ['$$category', ['Home', 'Garden', 'Furniture']] }, then: 'home' }
                            ],
                            default: 'other'
                          }
                        }
                      }
                    }
                  }
                ]
              }
            }
          },

          // Customer lifetime metrics (approximation)
          estimatedCustomerValue: {
            $multiply: [
              '$financial.total',
              { $add: ['$customer.averageOrdersPerYear', 1] }
            ]
          },

          // Fulfillment performance
          fulfillmentDays: {
            $cond: {
              if: { $and: ['$fulfillment.shippedAt', '$orderDate'] },
              then: {
                $divide: [
                  { $subtract: ['$fulfillment.shippedAt', '$orderDate'] },
                  86400000 // Convert milliseconds to days
                ]
              },
              else: null
            }
          }
        }
      },

      // Stage 7: Group by multiple dimensions for comprehensive analytics
      {
        $group: {
          _id: {
            customerSegment: '$customerSegment',
            geographicSegment: '$geographicSegment',
            orderMonth: '$orderMonth',
            orderQuarter: '$orderQuarter',
            season: '$season',
            orderValueSegment: '$orderValueSegment'
          },

          // Customer metrics
          uniqueCustomers: { $addToSet: '$customerId' },
          totalOrders: { $sum: 1 },

          // Financial metrics
          totalRevenue: { $sum: '$financial.total' },
          totalDiscount: { $sum: '$financial.discount' },
          totalTax: { $sum: '$financial.tax' },
          totalShipping: { $sum: '$financial.shipping' },

          // Order value statistics
          avgOrderValue: { $avg: '$financial.total' },
          maxOrderValue: { $max: '$financial.total' },
          minOrderValue: { $min: '$financial.total' },

          // Product and item metrics
          totalItems: { $sum: { $size: '$items' } },
          avgItemsPerOrder: { $avg: { $size: '$items' } },
          uniqueProducts: { 
            $addToSet: {
              $reduce: {
                input: '$items',
                initialValue: [],
                in: { $concatArrays: ['$$value', ['$$this.productId']] }
              }
            }
          },

          // Category distribution
          categoryBreakdown: {
            $push: {
              $map: {
                input: '$enrichedItems',
                in: '$$this.categoryGroup'
              }
            }
          },

          // Customer behavior metrics
          avgCustomerValue: { $avg: '$estimatedCustomerValue' },
          loyaltyTierDistribution: { $push: '$customer.loyaltyTier' },

          // Operational metrics
          avgFulfillmentDays: { $avg: '$fulfillmentDays' },
          paymentMethodDistribution: { $push: '$payment.method' },
          shippingMethodDistribution: { $push: '$shipping.method' },

          // Geographic insights
          stateDistribution: { $push: '$customer.address.state' },
          countryDistribution: { $push: '$customer.address.country' },

          // Time-based patterns
          dayOfWeekDistribution: { $push: '$dayOfWeek' },
          hourOfDayDistribution: { $push: '$hourOfDay' },

          // Customer acquisition and retention
          newCustomersCount: {
            $sum: {
              $cond: [{ $eq: ['$customerSegment', 'new_customer'] }, 1, 0]
            }
          },
          returningCustomersCount: {
            $sum: {
              $cond: [{ $eq: ['$customerSegment', 'returning_customer'] }, 1, 0]
            }
          },

          // First and last order dates for trend analysis
          firstOrderDate: { $min: '$orderDate' },
          lastOrderDate: { $max: '$orderDate' }
        }
      },

      // Stage 8: Calculate derived metrics and insights
      {
        $addFields: {
          // Calculate actual unique counts
          uniqueCustomerCount: { $size: '$uniqueCustomers' },
          uniqueProductCount: {
            $size: {
              $reduce: {
                input: '$uniqueProducts',
                initialValue: [],
                in: { $setUnion: ['$$value', '$$this'] }
              }
            }
          },

          // Revenue per customer
          revenuePerCustomer: {
            $cond: {
              if: { $gt: [{ $size: '$uniqueCustomers' }, 0] },
              then: { $divide: ['$totalRevenue', { $size: '$uniqueCustomers' }] },
              else: 0
            }
          },

          // Margin analysis
          grossMargin: { $subtract: ['$totalRevenue', '$totalDiscount'] },
          marginPercentage: {
            $multiply: [
              { $divide: [{ $subtract: ['$totalRevenue', '$totalDiscount'] }, '$totalRevenue'] },
              100
            ]
          },

          // Category insights
          topCategories: {
            $slice: [
              {
                $map: {
                  input: {
                    $sortArray: {
                      input: {
                        $objectToArray: {
                          $reduce: {
                            input: {
                              $reduce: {
                                input: '$categoryBreakdown',
                                initialValue: [],
                                in: { $concatArrays: ['$$value', '$$this'] }
                              }
                            },
                            initialValue: {},
                            in: {
                              $mergeObjects: [
                                '$$value',
                                { ['$$this']: { $add: [{ $ifNull: [{ $getField: ['$$this', '$$value'] }, 0] }, 1] } }
                              ]
                            }
                          }
                        }
                      },
                      sortBy: { v: -1 }
                    }
                  },
                  in: { category: '$$this.k', count: '$$this.v' }
                }
              },
              5 // Top 5 categories
            ]
          },

          // Customer distribution insights
          customerSegmentMetrics: {
            newCustomerPercentage: {
              $multiply: [
                { $divide: ['$newCustomersCount', '$totalOrders'] },
                100
              ]
            },
            returningCustomerPercentage: {
              $multiply: [
                { $divide: ['$returningCustomersCount', '$totalOrders'] },
                100
              ]
            }
          },

          // Time range analysis
          analysisPeriodDays: {
            $divide: [
              { $subtract: ['$lastOrderDate', '$firstOrderDate'] },
              86400000 // Convert to days
            ]
          },

          // Performance indicators
          performanceMetrics: {
            ordersPerDay: {
              $divide: [
                '$totalOrders',
                { $divide: [{ $subtract: ['$lastOrderDate', '$firstOrderDate'] }, 86400000] }
              ]
            },
            avgRevenuePerDay: {
              $divide: [
                '$totalRevenue',
                { $divide: [{ $subtract: ['$lastOrderDate', '$firstOrderDate'] }, 86400000] }
              ]
            }
          }
        }
      },

      // Stage 9: Project final results with clean structure
      {
        $project: {
          _id: 0,

          // Dimensions
          dimensions: '$_id',

          // Core metrics
          metrics: {
            customers: {
              total: '$uniqueCustomerCount',
              new: '$newCustomersCount',
              returning: '$returningCustomersCount',
              newPercentage: '$customerSegmentMetrics.newCustomerPercentage',
              returningPercentage: '$customerSegmentMetrics.returningCustomerPercentage'
            },

            orders: {
              total: '$totalOrders',
              averageValue: '$avgOrderValue',
              maxValue: '$maxOrderValue',
              minValue: '$minOrderValue',
              itemsPerOrder: '$avgItemsPerOrder'
            },

            revenue: {
              total: '$totalRevenue',
              gross: '$grossMargin',
              marginPercentage: '$marginPercentage',
              revenuePerCustomer: '$revenuePerCustomer',
              totalDiscount: '$totalDiscount',
              totalTax: '$totalTax',
              totalShipping: '$totalShipping'
            },

            products: {
              uniqueCount: '$uniqueProductCount',
              totalItems: '$totalItems',
              topCategories: '$topCategories'
            },

            operations: {
              avgFulfillmentDays: '$avgFulfillmentDays',
              analysisPeriodDays: '$analysisPeriodDays'
            },

            performance: '$performanceMetrics'
          },

          // Insights and distributions
          insights: {
            loyaltyTiers: '$loyaltyTierDistribution',
            paymentMethods: '$paymentMethodDistribution',
            shippingMethods: '$shippingMethodDistribution',
            geographic: {
              states: '$stateDistribution',
              countries: '$countryDistribution'
            },
            temporal: {
              daysOfWeek: '$dayOfWeekDistribution',
              hoursOfDay: '$hourOfDayDistribution'
            }
          },

          // Time range
          timeRange: {
            startDate: '$firstOrderDate',
            endDate: '$lastOrderDate'
          }
        }
      },

      // Stage 10: Sort results by significance
      {
        $sort: {
          'metrics.revenue.total': -1,
          'metrics.customers.total': -1
        }
      }
    ];

    try {
      const results = await this.collections.orders.aggregate(pipeline, this.aggregationOptions).toArray();

      const processingTime = Date.now() - startTime;
      console.log(`Customer analytics completed in ${processingTime}ms, found ${results.length} result groups`);

      return {
        success: true,
        processingTimeMs: processingTime,
        timeRange: timeRange,
        resultCount: results.length,
        analytics: results,
        metadata: {
          queryComplexity: 'high',
          stagesCount: pipeline.length,
          indexesUsed: 'multiple_compound_indexes',
          aggregationFeatures: [
            'lookup_joins',
            'complex_expressions',
            'grouping_aggregations', 
            'conditional_logic',
            'array_operations',
            'date_functions',
            'mathematical_operations'
          ]
        }
      };

    } catch (error) {
      console.error('Error performing customer analytics:', error);
      return {
        success: false,
        error: error.message,
        processingTimeMs: Date.now() - startTime
      };
    }
  }

  async performRealTimeProductAnalytics(refreshInterval = 60000) {
    console.log('Starting real-time product performance analytics...');

    const pipeline = [
      // Stage 1: Match recent orders (last 24 hours)
      {
        $match: {
          orderDate: { 
            $gte: new Date(Date.now() - 24 * 60 * 60 * 1000) 
          },
          status: { $in: ['completed', 'processing', 'shipped'] }
        }
      },

      // Stage 2: Unwind order items for item-level analysis
      {
        $unwind: '$items'
      },

      // Stage 3: Lookup product details
      {
        $lookup: {
          from: 'products',
          localField: 'items.productId',
          foreignField: 'productId',
          as: 'product'
        }
      },

      // Stage 4: Unwind product (should be single document)
      {
        $unwind: '$product'
      },

      // Stage 5: Lookup current inventory levels
      {
        $lookup: {
          from: 'inventory',
          localField: 'items.productId',
          foreignField: 'productId',
          as: 'inventory'
        }
      },

      // Stage 6: Add calculated fields for real-time metrics
      {
        $addFields: {
          // Time buckets for real-time analysis
          hourBucket: {
            $dateTrunc: {
              date: '$orderDate',
              unit: 'hour'
            }
          },

          // Product performance metrics
          itemRevenue: { $multiply: ['$items.quantity', '$items.unitPrice'] },
          itemMargin: {
            $multiply: [
              '$items.quantity',
              { $subtract: ['$items.unitPrice', '$product.costPerUnit'] }
            ]
          },

          // Inventory status
          currentStock: { $arrayElemAt: ['$inventory.currentStock', 0] },
          reorderLevel: { $arrayElemAt: ['$inventory.reorderLevel', 0] },

          // Product categorization
          categoryGroup: {
            $switch: {
              branches: [
                { case: { $eq: ['$product.category', 'Electronics'] }, then: 'tech' },
                { case: { $in: ['$product.category', ['Clothing', 'Shoes']] }, then: 'fashion' },
                { case: { $in: ['$product.category', ['Home', 'Garden']] }, then: 'home' }
              ],
              default: 'other'
            }
          },

          // Price performance
          pricePoint: {
            $switch: {
              branches: [
                { case: { $gte: ['$items.unitPrice', 500] }, then: 'premium' },
                { case: { $gte: ['$items.unitPrice', 100] }, then: 'mid_range' },
                { case: { $gte: ['$items.unitPrice', 25] }, then: 'budget' }
              ],
              default: 'economy'
            }
          },

          // Velocity indicators
          orderRecency: {
            $divide: [
              { $subtract: [new Date(), '$orderDate'] },
              3600000 // Convert to hours
            ]
          }
        }
      },

      // Stage 7: Group by product and time buckets for real-time aggregation
      {
        $group: {
          _id: {
            productId: '$items.productId',
            productName: '$product.name',
            category: '$product.category',
            categoryGroup: '$categoryGroup',
            brand: '$product.brand',
            pricePoint: '$pricePoint',
            hourBucket: '$hourBucket'
          },

          // Sales metrics
          totalQuantitySold: { $sum: '$items.quantity' },
          totalRevenue: { $sum: '$itemRevenue' },
          totalMargin: { $sum: '$itemMargin' },
          uniqueOrders: { $addToSet: '$_id' },
          avgOrderQuantity: { $avg: '$items.quantity' },

          // Pricing metrics
          avgSellingPrice: { $avg: '$items.unitPrice' },
          maxSellingPrice: { $max: '$items.unitPrice' },
          minSellingPrice: { $min: '$items.unitPrice' },

          // Inventory insights
          currentStockLevel: { $first: '$currentStock' },
          reorderThreshold: { $first: '$reorderLevel' },

          // Time-based insights
          avgOrderRecency: { $avg: '$orderRecency' },
          latestOrderTime: { $max: '$orderDate' },
          earliestOrderTime: { $min: '$orderDate' },

          // Customer insights
          uniqueCustomers: { $addToSet: '$customerId' },

          // Geographic distribution
          regions: { $addToSet: '$customer.address.state' },

          // Order characteristics
          avgOrderValue: { $avg: '$financial.total' },
          shippingMethodsUsed: { $addToSet: '$shipping.method' }
        }
      },

      // Stage 8: Calculate performance indicators and rankings
      {
        $addFields: {
          // Performance calculations
          marginPercentage: {
            $cond: {
              if: { $gt: ['$totalRevenue', 0] },
              then: { $multiply: [{ $divide: ['$totalMargin', '$totalRevenue'] }, 100] },
              else: 0
            }
          },

          uniqueOrderCount: { $size: '$uniqueOrders' },
          uniqueCustomerCount: { $size: '$uniqueCustomers' },

          // Inventory health
          stockStatus: {
            $switch: {
              branches: [
                {
                  case: { $lte: ['$currentStockLevel', 0] },
                  then: 'out_of_stock'
                },
                {
                  case: { $lte: ['$currentStockLevel', '$reorderThreshold'] },
                  then: 'low_stock'
                },
                {
                  case: { $lte: ['$currentStockLevel', { $multiply: ['$reorderThreshold', 2] }] },
                  then: 'medium_stock'
                }
              ],
              default: 'high_stock'
            }
          },

          // Velocity metrics
          salesVelocity: {
            $divide: ['$totalQuantitySold', { $max: ['$avgOrderRecency', 1] }]
          },

          // Customer engagement
          customerRetention: {
            $divide: ['$uniqueCustomerCount', '$uniqueOrderCount']
          },

          // Regional penetration
          regionalReach: { $size: '$regions' }
        }
      },

      // Stage 9: Add ranking and performance classification
      {
        $setWindowFields: {
          sortBy: { totalRevenue: -1 },
          output: {
            revenueRank: { $rank: {} },
            revenuePercentile: { $percentRank: {} }
          }
        }
      },

      {
        $setWindowFields: {
          partitionBy: '$_id.categoryGroup',
          sortBy: { totalQuantitySold: -1 },
          output: {
            categoryRank: { $rank: {} }
          }
        }
      },

      // Stage 10: Add performance classification
      {
        $addFields: {
          performanceClassification: {
            $switch: {
              branches: [
                {
                  case: {
                    $and: [
                      { $lte: ['$revenueRank', 10] },
                      { $gt: ['$marginPercentage', 20] },
                      { $gt: ['$uniqueCustomerCount', 5] }
                    ]
                  },
                  then: 'star_performer'
                },
                {
                  case: {
                    $and: [
                      { $lte: ['$revenueRank', 50] },
                      { $gt: ['$marginPercentage', 15] }
                    ]
                  },
                  then: 'strong_performer'
                },
                {
                  case: {
                    $and: [
                      { $gt: ['$totalRevenue', 100] },
                      { $gt: ['$marginPercentage', 10] }
                    ]
                  },
                  then: 'solid_performer'
                },
                {
                  case: { $lte: ['$totalRevenue', 50] },
                  then: 'low_performer'
                }
              ],
              default: 'average_performer'
            }
          },

          // Action recommendations
          recommendations: {
            $switch: {
              branches: [
                {
                  case: { $eq: ['$stockStatus', 'out_of_stock'] },
                  then: ['urgent_restock', 'review_demand_forecast']
                },
                {
                  case: { $eq: ['$stockStatus', 'low_stock'] },
                  then: ['schedule_restock', 'monitor_sales_velocity']
                },
                {
                  case: {
                    $and: [
                      { $eq: ['$performanceClassification', 'star_performer'] },
                      { $gt: ['$currentStockLevel', '$reorderThreshold'] }
                    ]
                  },
                  then: ['increase_marketing', 'optimize_pricing', 'expand_availability']
                },
                {
                  case: { $eq: ['$performanceClassification', 'low_performer'] },
                  then: ['review_pricing', 'improve_marketing', 'consider_discontinuation']
                }
              ],
              default: ['monitor_performance', 'optimize_inventory_levels']
            }
          },

          // Real-time alerts
          alerts: {
            $filter: {
              input: [
                {
                  $cond: {
                    if: { $eq: ['$stockStatus', 'out_of_stock'] },
                    then: {
                      type: 'critical',
                      message: 'Product is out of stock with active sales',
                      priority: 'high'
                    },
                    else: null
                  }
                },
                {
                  $cond: {
                    if: {
                      $and: [
                        { $eq: ['$performanceClassification', 'star_performer'] },
                        { $eq: ['$stockStatus', 'low_stock'] }
                      ]
                    },
                    then: {
                      type: 'opportunity',
                      message: 'High-performing product running low on stock',
                      priority: 'medium'
                    },
                    else: null
                  }
                },
                {
                  $cond: {
                    if: { $lt: ['$marginPercentage', 5] },
                    then: {
                      type: 'margin_concern',
                      message: 'Product margin below threshold',
                      priority: 'low'
                    },
                    else: null
                  }
                }
              ],
              cond: { $ne: ['$$this', null] }
            }
          }
        }
      },

      // Stage 11: Final projection with structured output
      {
        $project: {
          _id: 0,

          // Product identification
          product: {
            id: '$_id.productId',
            name: '$_id.productName',
            category: '$_id.category',
            categoryGroup: '$_id.categoryGroup',
            brand: '$_id.brand',
            pricePoint: '$_id.pricePoint'
          },

          // Time context
          timeContext: {
            hourBucket: '$_id.hourBucket',
            latestOrder: '$latestOrderTime',
            earliestOrder: '$earliestOrderTime',
            avgOrderRecencyHours: '$avgOrderRecency'
          },

          // Performance metrics
          performance: {
            totalQuantitySold: '$totalQuantitySold',
            totalRevenue: { $round: ['$totalRevenue', 2] },
            totalMargin: { $round: ['$totalMargin', 2] },
            marginPercentage: { $round: ['$marginPercentage', 1] },
            uniqueOrders: '$uniqueOrderCount',
            uniqueCustomers: '$uniqueCustomerCount',
            avgOrderQuantity: { $round: ['$avgOrderQuantity', 2] },
            salesVelocity: { $round: ['$salesVelocity', 3] },
            customerRetention: { $round: ['$customerRetention', 3] }
          },

          // Pricing insights
          pricing: {
            avgSellingPrice: { $round: ['$avgSellingPrice', 2] },
            maxSellingPrice: '$maxSellingPrice',
            minSellingPrice: '$minSellingPrice',
            priceVariation: { $subtract: ['$maxSellingPrice', '$minSellingPrice'] }
          },

          // Inventory status
          inventory: {
            currentStock: '$currentStockLevel',
            reorderLevel: '$reorderThreshold',
            stockStatus: '$stockStatus',
            stockTurnover: {
              $cond: {
                if: { $gt: ['$currentStockLevel', 0] },
                then: { $divide: ['$totalQuantitySold', '$currentStockLevel'] },
                else: null
              }
            }
          },

          // Market position
          marketPosition: {
            revenueRank: '$revenueRank',
            revenuePercentile: { $round: ['$revenuePercentile', 3] },
            categoryRank: '$categoryRank',
            performanceClass: '$performanceClassification'
          },

          // Geographic and market reach
          marketReach: {
            regionalReach: '$regionalReach',
            regions: '$regions',
            avgOrderValue: { $round: ['$avgOrderValue', 2] },
            shippingMethods: '$shippingMethodsUsed'
          },

          // Actionable insights
          insights: {
            recommendations: '$recommendations',
            alerts: '$alerts',

            // Key insights derived from data
            keyInsights: {
              $filter: {
                input: [
                  {
                    $cond: {
                      if: { $gt: ['$uniqueCustomerCount', 10] },
                      then: 'High customer engagement - good repeat purchase potential',
                      else: null
                    }
                  },
                  {
                    $cond: {
                      if: { $gt: ['$regionalReach', 5] },
                      then: 'Strong geographic distribution - consider expanding marketing',
                      else: null
                    }
                  },
                  {
                    $cond: {
                      if: { $gt: ['$salesVelocity', 1] },
                      then: 'Fast-moving product - ensure adequate inventory levels',
                      else: null
                    }
                  }
                ],
                cond: { $ne: ['$$this', null] }
              }
            }
          }
        }
      },

      // Stage 12: Sort by performance and significance
      {
        $sort: {
          'performance.totalRevenue': -1,
          'performance.uniqueCustomers': -1,
          'inventory.stockTurnover': -1
        }
      },

      // Stage 13: Limit to top performers for real-time display
      {
        $limit: 100
      }
    ];

    try {
      const results = await this.collections.orders.aggregate(pipeline, {
        ...this.aggregationOptions,
        maxTimeMS: 30000 // Shorter timeout for real-time queries
      }).toArray();

      // Store results for real-time dashboard
      await this.collections.realTimeMetrics.replaceOne(
        { type: 'product_performance' },
        {
          type: 'product_performance',
          timestamp: new Date(),
          refreshInterval: refreshInterval,
          dataCount: results.length,
          data: results
        },
        { upsert: true }
      );

      console.log(`Real-time product analytics completed: ${results.length} products analyzed`);

      return {
        success: true,
        timestamp: new Date(),
        productCount: results.length,
        analytics: results,
        summary: {
          totalRevenue: results.reduce((sum, product) => sum + product.performance.totalRevenue, 0),
          totalQuantitySold: results.reduce((sum, product) => sum + product.performance.totalQuantitySold, 0),
          avgMarginPercentage: results.reduce((sum, product) => sum + product.performance.marginPercentage, 0) / results.length,
          outOfStockProducts: results.filter(product => product.inventory.stockStatus === 'out_of_stock').length,
          starPerformers: results.filter(product => product.marketPosition.performanceClass === 'star_performer').length,
          criticalAlerts: results.reduce((sum, product) => 
            sum + product.insights.alerts.filter(alert => alert.priority === 'high').length, 0
          )
        }
      };

    } catch (error) {
      console.error('Error performing real-time product analytics:', error);
      return {
        success: false,
        error: error.message,
        timestamp: new Date()
      };
    }
  }

  async performAdvancedCohortAnalysis(cohortType = 'monthly', lookbackPeriods = 12) {
    console.log(`Performing ${cohortType} cohort analysis for ${lookbackPeriods} periods...`);

    const startTime = Date.now();

    // Calculate cohort periods based on type
    const cohortConfig = {
      'weekly': { unit: 'week', periodMs: 7 * 24 * 60 * 60 * 1000 },
      'monthly': { unit: 'month', periodMs: 30 * 24 * 60 * 60 * 1000 },
      'quarterly': { unit: 'quarter', periodMs: 90 * 24 * 60 * 60 * 1000 }
    };

    const config = cohortConfig[cohortType] || cohortConfig['monthly'];
    const startDate = new Date(Date.now() - lookbackPeriods * config.periodMs);

    const pipeline = [
      // Stage 1: Get all customer first orders to establish cohorts
      {
        $match: {
          orderDate: { $gte: startDate },
          status: { $in: ['completed', 'shipped', 'delivered'] }
        }
      },

      // Stage 2: Get customer first order dates
      {
        $group: {
          _id: '$customerId',
          firstOrderDate: { $min: '$orderDate' },
          allOrders: {
            $push: {
              orderId: '$_id',
              orderDate: '$orderDate',
              total: '$financial.total',
              items: '$items'
            }
          },
          totalOrders: { $sum: 1 },
          totalSpent: { $sum: '$financial.total' },
          lastOrderDate: { $max: '$orderDate' }
        }
      },

      // Stage 3: Calculate cohort membership and period analysis
      {
        $addFields: {
          // Determine which cohort this customer belongs to
          cohortPeriod: {
            $dateTrunc: {
              date: '$firstOrderDate',
              unit: config.unit
            }
          },

          // Calculate customer lifetime span
          lifetimeSpanDays: {
            $divide: [
              { $subtract: ['$lastOrderDate', '$firstOrderDate'] },
              86400000
            ]
          },

          // Analyze orders by period
          ordersByPeriod: {
            $map: {
              input: '$allOrders',
              as: 'order',
              in: {
                $mergeObjects: [
                  '$$order',
                  {
                    orderPeriod: {
                      $dateTrunc: {
                        date: '$$order.orderDate',
                        unit: config.unit
                      }
                    },
                    periodsFromFirstOrder: {
                      $divide: [
                        {
                          $subtract: [
                            {
                              $dateTrunc: {
                                date: '$$order.orderDate',
                                unit: config.unit
                              }
                            },
                            {
                              $dateTrunc: {
                                date: '$firstOrderDate',
                                unit: config.unit
                              }
                            }
                          ]
                        },
                        config.periodMs
                      ]
                    }
                  }
                ]
              }
            }
          }
        }
      },

      // Stage 4: Unwind orders to analyze period-by-period behavior
      {
        $unwind: '$ordersByPeriod'
      },

      // Stage 5: Group by cohort and period for retention analysis
      {
        $group: {
          _id: {
            cohortPeriod: '$cohortPeriod',
            orderPeriod: '$ordersByPeriod.orderPeriod',
            periodsFromFirst: { 
              $floor: '$ordersByPeriod.periodsFromFirstOrder' 
            }
          },

          // Customer retention metrics
          activeCustomers: { $addToSet: '$_id' },
          totalOrders: { $sum: 1 },
          totalRevenue: { $sum: '$ordersByPeriod.total' },
          avgOrderValue: { $avg: '$ordersByPeriod.total' },

          // Customer behavior metrics
          avgLifetimeSpan: { $avg: '$lifetimeSpanDays' },
          totalCustomerLifetimeValue: { $avg: '$totalSpent' },
          avgOrdersPerCustomer: { $avg: '$totalOrders' },

          // Period-specific insights
          newCustomersInPeriod: {
            $sum: {
              $cond: [
                { $eq: ['$ordersByPeriod.periodsFromFirstOrder', 0] },
                1,
                0
              ]
            }
          },

          // Revenue distribution
          revenueDistribution: {
            $push: '$ordersByPeriod.total'
          },

          // Order frequency analysis
          orderFrequencyDistribution: {
            $push: '$totalOrders'
          }
        }
      },

      // Stage 6: Calculate cohort size (initial customers in each cohort)
      {
        $lookup: {
          from: 'orders',
          pipeline: [
            {
              $match: {
                orderDate: { $gte: startDate },
                status: { $in: ['completed', 'shipped', 'delivered'] }
              }
            },
            {
              $group: {
                _id: '$customerId',
                firstOrderDate: { $min: '$orderDate' }
              }
            },
            {
              $addFields: {
                cohortPeriod: {
                  $dateTrunc: {
                    date: '$firstOrderDate',
                    unit: config.unit
                  }
                }
              }
            },
            {
              $group: {
                _id: '$cohortPeriod',
                cohortSize: { $sum: 1 }
              }
            }
          ],
          as: 'cohortSizes'
        }
      },

      // Stage 7: Add cohort size information
      {
        $addFields: {
          cohortSize: {
            $let: {
              vars: {
                matchingCohort: {
                  $arrayElemAt: [
                    {
                      $filter: {
                        input: '$cohortSizes',
                        cond: { $eq: ['$$this._id', '$_id.cohortPeriod'] }
                      }
                    },
                    0
                  ]
                }
              },
              in: '$$matchingCohort.cohortSize'
            }
          },

          activeCustomerCount: { $size: '$activeCustomers' },

          // Calculate retention rate
          retentionRate: {
            $let: {
              vars: {
                cohortSize: {
                  $arrayElemAt: [
                    {
                      $map: {
                        input: {
                          $filter: {
                            input: '$cohortSizes',
                            cond: { $eq: ['$$this._id', '$_id.cohortPeriod'] }
                          }
                        },
                        in: '$$this.cohortSize'
                      }
                    },
                    0
                  ]
                }
              },
              in: {
                $multiply: [
                  { $divide: [{ $size: '$activeCustomers' }, '$$cohortSize'] },
                  100
                ]
              }
            }
          }
        }
      },

      // Stage 8: Calculate advanced cohort metrics
      {
        $addFields: {
          // Revenue per customer in this period
          revenuePerCustomer: {
            $divide: ['$totalRevenue', '$activeCustomerCount']
          },

          // Customer engagement score
          engagementScore: {
            $multiply: [
              { $divide: ['$totalOrders', '$activeCustomerCount'] },
              { $divide: ['$retentionRate', 100] }
            ]
          },

          // Revenue distribution analysis
          revenueMetrics: {
            median: {
              $arrayElemAt: [
                {
                  $sortArray: {
                    input: '$revenueDistribution',
                    sortBy: 1
                  }
                },
                { $floor: { $divide: [{ $size: '$revenueDistribution' }, 2] } }
              ]
            },
            total: '$totalRevenue',
            average: '$avgOrderValue',
            max: { $max: '$revenueDistribution' },
            min: { $min: '$revenueDistribution' }
          },

          // Period classification
          periodClassification: {
            $switch: {
              branches: [
                { case: { $eq: ['$_id.periodsFromFirst', 0] }, then: 'acquisition' },
                { case: { $lte: ['$_id.periodsFromFirst', 3] }, then: 'early_engagement' },
                { case: { $lte: ['$_id.periodsFromFirst', 12] }, then: 'mature_relationship' }
              ],
              default: 'long_term_loyalty'
            }
          }
        }
      },

      // Stage 9: Group by cohort for final analysis
      {
        $group: {
          _id: '$_id.cohortPeriod',
          cohortSize: { $first: '$cohortSize' },

          // Retention analysis by period
          retentionByPeriod: {
            $push: {
              period: '$_id.periodsFromFirst',
              orderPeriod: '$_id.orderPeriod',
              activeCustomers: '$activeCustomerCount',
              retentionRate: '$retentionRate',
              totalRevenue: '$totalRevenue',
              revenuePerCustomer: '$revenuePerCustomer',
              avgOrderValue: '$avgOrderValue',
              totalOrders: '$totalOrders',
              engagementScore: '$engagementScore',
              periodClassification: '$periodClassification',
              revenueMetrics: '$revenueMetrics'
            }
          },

          // Aggregate cohort metrics
          totalLifetimeRevenue: { $sum: '$totalRevenue' },
          avgLifetimeValue: { $avg: '$totalCustomerLifetimeValue' },
          peakRetentionRate: { $max: '$retentionRate' },
          finalRetentionRate: { $last: '$retentionRate' },
          avgEngagementScore: { $avg: '$engagementScore' },

          // Cohort performance classification
          cohortHealth: {
            $avg: {
              $cond: [
                { $gte: ['$retentionRate', 30] }, // 30% retention considered healthy
                1,
                0
              ]
            }
          }
        }
      },

      // Stage 10: Calculate cohort performance indicators
      {
        $addFields: {
          // Lifetime value per customer in cohort
          lifetimeValuePerCustomer: {
            $divide: ['$totalLifetimeRevenue', '$cohortSize']
          },

          // Retention curve analysis
          retentionTrend: {
            $let: {
              vars: {
                firstPeriodRetention: {
                  $arrayElemAt: [
                    {
                      $map: {
                        input: {
                          $filter: {
                            input: '$retentionByPeriod',
                            cond: { $eq: ['$$this.period', 0] }
                          }
                        },
                        in: '$$this.retentionRate'
                      }
                    },
                    0
                  ]
                },
                lastPeriodRetention: '$finalRetentionRate'
              },
              in: {
                $subtract: ['$$lastPeriodRetention', '$$firstPeriodRetention']
              }
            }
          },

          // Cohort quality classification
          cohortQuality: {
            $switch: {
              branches: [
                {
                  case: {
                    $and: [
                      { $gte: ['$peakRetentionRate', 50] },
                      { $gte: ['$avgEngagementScore', 1.5] },
                      { $gte: ['$lifetimeValuePerCustomer', 500] }
                    ]
                  },
                  then: 'excellent'
                },
                {
                  case: {
                    $and: [
                      { $gte: ['$peakRetentionRate', 35] },
                      { $gte: ['$avgEngagementScore', 1.0] },
                      { $gte: ['$lifetimeValuePerCustomer', 250] }
                    ]
                  },
                  then: 'good'
                },
                {
                  case: {
                    $and: [
                      { $gte: ['$peakRetentionRate', 20] },
                      { $gte: ['$avgEngagementScore', 0.5] }
                    ]
                  },
                  then: 'fair'
                }
              ],
              default: 'poor'
            }
          },

          // Strategic recommendations
          recommendations: {
            $switch: {
              branches: [
                {
                  case: { $lt: ['$peakRetentionRate', 20] },
                  then: ['improve_onboarding', 'enhance_early_engagement', 'review_product_fit']
                },
                {
                  case: { $lt: ['$finalRetentionRate', 10] },
                  then: ['develop_loyalty_program', 'improve_long_term_value', 'increase_engagement']
                },
                {
                  case: { $lt: ['$avgEngagementScore', 0.5] },
                  then: ['enhance_customer_experience', 'increase_purchase_frequency', 'improve_product_recommendations']
                }
              ],
              default: ['maintain_excellence', 'scale_successful_strategies', 'explore_expansion_opportunities']
            }
          }
        }
      },

      // Stage 11: Final projection and formatting
      {
        $project: {
          _id: 0,

          // Cohort identification
          cohortPeriod: '$_id',
          cohortSize: 1,
          cohortQuality: 1,

          // Key performance metrics
          performance: {
            lifetimeValuePerCustomer: { $round: ['$lifetimeValuePerCustomer', 2] },
            avgLifetimeValue: { $round: ['$avgLifetimeValue', 2] },
            totalLifetimeRevenue: { $round: ['$totalLifetimeRevenue', 2] },
            peakRetentionRate: { $round: ['$peakRetentionRate', 1] },
            finalRetentionRate: { $round: ['$finalRetentionRate', 1] },
            retentionTrend: { $round: ['$retentionTrend', 1] },
            avgEngagementScore: { $round: ['$avgEngagementScore', 2] },
            cohortHealth: { $round: ['$cohortHealth', 2] }
          },

          // Detailed retention analysis
          retentionAnalysis: {
            $map: {
              input: { $sortArray: { input: '$retentionByPeriod', sortBy: { period: 1 } } },
              in: {
                period: '$$this.period',
                orderPeriod: '$$this.orderPeriod',
                activeCustomers: '$$this.activeCustomers',
                retentionRate: { $round: ['$$this.retentionRate', 1] },
                revenuePerCustomer: { $round: ['$$this.revenuePerCustomer', 2] },
                avgOrderValue: { $round: ['$$this.avgOrderValue', 2] },
                totalRevenue: { $round: ['$$this.totalRevenue', 2] },
                totalOrders: '$$this.totalOrders',
                engagementScore: { $round: ['$$this.engagementScore', 2] },
                periodClassification: '$$this.periodClassification',
                revenueMetrics: {
                  median: { $round: ['$$this.revenueMetrics.median', 2] },
                  average: { $round: ['$$this.revenueMetrics.average', 2] },
                  max: '$$this.revenueMetrics.max',
                  min: '$$this.revenueMetrics.min'
                }
              }
            }
          },

          // Strategic insights
          insights: {
            recommendations: '$recommendations',
            keyInsights: {
              $filter: {
                input: [
                  {
                    $cond: {
                      if: { $gt: ['$peakRetentionRate', 40] },
                      then: 'High-quality cohort with strong initial engagement',
                      else: null
                    }
                  },
                  {
                    $cond: {
                      if: { $gt: ['$retentionTrend', 0] },
                      then: 'Retention improving over time - successful loyalty building',
                      else: null
                    }
                  },
                  {
                    $cond: {
                      if: { $gt: ['$avgEngagementScore', 2] },
                      then: 'Highly engaged cohort with frequent repeat purchases',
                      else: null
                    }
                  },
                  {
                    $cond: {
                      if: { $gt: ['$lifetimeValuePerCustomer', 1000] },
                      then: 'High-value cohort - focus on retention and expansion',
                      else: null
                    }
                  }
                ],
                cond: { $ne: ['$$this', null] }
              }
            }
          }
        }
      },

      // Stage 12: Sort by cohort period (most recent first)
      {
        $sort: { cohortPeriod: -1 }
      }
    ];

    try {
      const results = await this.collections.orders.aggregate(pipeline, this.aggregationOptions).toArray();

      const processingTime = Date.now() - startTime;
      console.log(`Cohort analysis completed in ${processingTime}ms, analyzed ${results.length} cohorts`);

      // Calculate cross-cohort insights
      const crossCohortInsights = {
        totalCohorts: results.length,
        avgCohortSize: results.reduce((sum, cohort) => sum + cohort.cohortSize, 0) / results.length,
        avgLifetimeValue: results.reduce((sum, cohort) => sum + cohort.performance.lifetimeValuePerCustomer, 0) / results.length,
        bestPerformingCohort: results.reduce((best, current) => 
          current.performance.lifetimeValuePerCustomer > best.performance.lifetimeValuePerCustomer ? current : best, results[0]
        ),
        retentionTrendAvg: results.reduce((sum, cohort) => sum + cohort.performance.retentionTrend, 0) / results.length,
        excellentCohorts: results.filter(cohort => cohort.cohortQuality === 'excellent').length,
        improvingCohorts: results.filter(cohort => cohort.performance.retentionTrend > 0).length
      };

      return {
        success: true,
        processingTimeMs: processingTime,
        cohortType: cohortType,
        lookbackPeriods: lookbackPeriods,
        analysisDate: new Date(),
        cohortCount: results.length,
        cohorts: results,
        crossCohortInsights: crossCohortInsights,
        metadata: {
          aggregationComplexity: 'very_high',
          stagesCount: pipeline.length,
          analyticsFeatures: [
            'customer_lifetime_value',
            'retention_analysis',
            'cohort_segmentation',
            'behavioral_analysis',
            'revenue_attribution',
            'trend_analysis',
            'performance_classification',
            'strategic_recommendations'
          ]
        }
      };

    } catch (error) {
      console.error('Error performing cohort analysis:', error);
      return {
        success: false,
        error: error.message,
        processingTimeMs: Date.now() - startTime
      };
    }
  }
}

// Benefits of MongoDB Aggregation Framework:
// - Single-pass processing for complex multi-stage analytics
// - Native document transformation without expensive JOINs
// - Automatic query optimization and index utilization
// - Horizontal scaling across sharded clusters
// - Real-time processing capabilities with streaming aggregation
// - Rich expression language for complex calculations
// - Built-in statistical and analytical functions
// - Memory-efficient processing with spill-to-disk support
// - Integration with MongoDB's native features (GeoSpatial, Text Search, etc.)
// - SQL-compatible operations through QueryLeaf integration

module.exports = {
  AdvancedAnalyticsProcessor
};

Understanding MongoDB Aggregation Framework Architecture

Advanced Pipeline Optimization and Performance Patterns

Implement sophisticated aggregation strategies for enterprise MongoDB deployments:

// Production-optimized MongoDB Aggregation with advanced performance tuning
class EnterpriseAggregationOptimizer {
  constructor(db, optimizationConfig) {
    this.db = db;
    this.config = {
      ...optimizationConfig,
      enableQueryPlanCache: true,
      enableParallelProcessing: true,
      enableIncrementalProcessing: true,
      maxMemoryUsage: '2GB',
      enableIndexHints: true,
      enableResultCaching: true
    };

    this.queryPlanCache = new Map();
    this.resultCache = new Map();
    this.performanceMetrics = new Map();
  }

  async optimizeAggregationPipeline(pipeline, collectionName, options = {}) {
    console.log(`Optimizing aggregation pipeline for ${collectionName}...`);

    const optimizationStrategies = [
      this.moveMatchToBeginning,
      this.optimizeIndexUsage,
      this.enableEarlyFiltering,
      this.minimizeDataMovement,
      this.optimizeGroupingOperations,
      this.enableParallelExecution
    ];

    let optimizedPipeline = [...pipeline];

    for (const strategy of optimizationStrategies) {
      optimizedPipeline = await strategy.call(this, optimizedPipeline, collectionName, options);
    }

    return {
      originalStages: pipeline.length,
      optimizedStages: optimizedPipeline.length,
      optimizedPipeline: optimizedPipeline,
      estimatedPerformanceGain: this.calculatePerformanceGain(pipeline, optimizedPipeline)
    };
  }

  async enableRealTimeAggregation(pipeline, collectionName, refreshInterval = 5000) {
    console.log(`Setting up real-time aggregation for ${collectionName}...`);

    // Implementation of real-time aggregation with Change Streams
    const changeStream = this.db.collection(collectionName).watch([
      {
        $match: {
          operationType: { $in: ['insert', 'update', 'delete'] }
        }
      }
    ]);

    const realTimeProcessor = {
      pipeline: pipeline,
      lastResults: null,
      isProcessing: false,

      async processChanges() {
        if (this.isProcessing) return;

        this.isProcessing = true;
        try {
          const results = await this.db.collection(collectionName)
            .aggregate(pipeline, { allowDiskUse: true })
            .toArray();

          this.lastResults = results;

          // Emit real-time results to subscribers
          this.emitResults(results);

        } catch (error) {
          console.error('Real-time aggregation error:', error);
        } finally {
          this.isProcessing = false;
        }
      }
    };

    // Process changes as they occur
    changeStream.on('change', () => {
      realTimeProcessor.processChanges();
    });

    return realTimeProcessor;
  }

  async implementIncrementalAggregation(pipeline, collectionName, incrementField = 'updatedAt') {
    console.log(`Setting up incremental aggregation for ${collectionName}...`);

    // Track last processed timestamp
    let lastProcessedTime = await this.getLastProcessedTime(collectionName);

    const incrementalPipeline = [
      // Only process new/updated documents
      {
        $match: {
          [incrementField]: { $gt: lastProcessedTime }
        }
      },
      ...pipeline
    ];

    const results = await this.db.collection(collectionName)
      .aggregate(incrementalPipeline, { allowDiskUse: true })
      .toArray();

    // Update last processed time
    await this.updateLastProcessedTime(collectionName, new Date());

    return {
      incrementalResults: results,
      lastProcessedTime: lastProcessedTime,
      newProcessedTime: new Date(),
      documentsProcessed: results.length
    };
  }
}

SQL-Style Aggregation Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB Aggregation Framework operations:

-- QueryLeaf aggregation operations with SQL-familiar syntax

-- Complex customer analytics with CTEs and window functions
WITH customer_segments AS (
  SELECT 
    customer_id,
    customer_name,
    registration_date,

    -- Customer segmentation using CASE expressions
    CASE 
      WHEN registration_date >= CURRENT_TIMESTAMP - INTERVAL '90 days' THEN 'new_customer'
      WHEN last_order_date >= CURRENT_TIMESTAMP - INTERVAL '30 days' THEN 'active_customer'
      WHEN last_order_date >= CURRENT_TIMESTAMP - INTERVAL '180 days' THEN 'returning_customer'
      ELSE 'dormant_customer'
    END as customer_segment,

    -- Geographic classification using nested CASE
    CASE 
      WHEN JSON_EXTRACT(address, '$.country') = 'US' THEN 'domestic'
      WHEN JSON_EXTRACT(address, '$.country') IN ('CA', 'MX') THEN 'north_america'
      WHEN JSON_EXTRACT(address, '$.country') IN ('GB', 'DE', 'FR', 'IT', 'ES') THEN 'europe'
      ELSE 'international'
    END as geographic_segment,

    -- Customer value classification
    total_spent,
    total_orders,
    average_order_value,
    loyalty_tier

  FROM customers
  WHERE is_active = true
),

order_analytics AS (
  SELECT 
    o._id as order_id,
    o.customer_id,
    o.order_date,

    -- Time-based dimensions using date functions
    DATE_TRUNC('month', o.order_date) as order_month,
    DATE_TRUNC('quarter', o.order_date) as order_quarter,
    EXTRACT(year FROM o.order_date) as order_year,
    EXTRACT(dow FROM o.order_date) as day_of_week,
    EXTRACT(hour FROM o.order_date) as hour_of_day,

    -- Seasonal analysis
    CASE 
      WHEN EXTRACT(month FROM o.order_date) IN (12, 1, 2) THEN 'winter'
      WHEN EXTRACT(month FROM o.order_date) IN (3, 4, 5) THEN 'spring'
      WHEN EXTRACT(month FROM o.order_date) IN (6, 7, 8) THEN 'summer'
      ELSE 'fall'
    END as season,

    -- Financial metrics
    JSON_EXTRACT(financial, '$.total') as order_total,
    JSON_EXTRACT(financial, '$.discount') as discount_amount,
    JSON_EXTRACT(financial, '$.tax') as tax_amount,
    JSON_EXTRACT(financial, '$.shipping') as shipping_amount,

    -- Order classification
    CASE 
      WHEN JSON_EXTRACT(financial, '$.total') >= 1000 THEN 'high_value'
      WHEN JSON_EXTRACT(financial, '$.total') >= 500 THEN 'medium_value'
      WHEN JSON_EXTRACT(financial, '$.total') >= 100 THEN 'low_value'
      ELSE 'micro_transaction'
    END as order_value_segment,

    -- Item analysis using JSON functions
    JSON_ARRAY_LENGTH(items) as item_count,

    -- Payment and shipping insights
    JSON_EXTRACT(payment, '$.method') as payment_method,
    JSON_EXTRACT(shipping, '$.method') as shipping_method,
    JSON_EXTRACT(shipping, '$.region') as shipping_region

  FROM orders o
  WHERE o.order_date >= CURRENT_TIMESTAMP - INTERVAL '12 months'
    AND o.status IN ('completed', 'shipped', 'delivered')
),

product_performance AS (
  SELECT 
    oi.order_id,

    -- Unnest items array for item-level analysis
    JSON_EXTRACT(item, '$.product_id') as product_id,
    JSON_EXTRACT(item, '$.quantity') as quantity,
    JSON_EXTRACT(item, '$.unit_price') as unit_price,
    JSON_EXTRACT(item, '$.total_price') as item_total,

    -- Product details from JOIN
    p.product_name,
    p.category,
    p.brand,
    p.cost_per_unit,

    -- Calculate margins
    (JSON_EXTRACT(item, '$.unit_price') - p.cost_per_unit) as unit_margin,
    (JSON_EXTRACT(item, '$.unit_price') - p.cost_per_unit) * JSON_EXTRACT(item, '$.quantity') as total_margin,

    -- Product categorization
    CASE 
      WHEN p.category = 'Electronics' THEN 'tech'
      WHEN p.category IN ('Clothing', 'Shoes', 'Accessories') THEN 'fashion'
      WHEN p.category IN ('Home', 'Garden', 'Furniture') THEN 'home'
      ELSE 'other'
    END as product_group

  FROM order_analytics oa
  CROSS JOIN JSON_TABLE(
    oa.items, '$[*]' COLUMNS (
      item JSON PATH '$'
    )
  ) AS items_table
  JOIN products p ON JSON_EXTRACT(items_table.item, '$.product_id') = p.product_id
  WHERE p.is_active = true
),

comprehensive_analytics AS (
  SELECT 
    -- Dimensional attributes
    cs.customer_segment,
    cs.geographic_segment,
    oa.order_month,
    oa.order_quarter,
    oa.season,
    oa.order_value_segment,
    pp.product_group,
    pp.category,
    pp.brand,

    -- Aggregated metrics using window functions
    COUNT(DISTINCT cs.customer_id) as unique_customers,
    COUNT(DISTINCT oa.order_id) as total_orders,
    COUNT(DISTINCT pp.product_id) as unique_products,

    -- Revenue metrics
    SUM(oa.order_total) as total_revenue,
    AVG(oa.order_total) as avg_order_value,
    SUM(pp.total_margin) as total_margin,

    -- Customer metrics with window functions
    AVG(SUM(oa.order_total)) OVER (
      PARTITION BY cs.customer_id
    ) as avg_customer_monthly_spend,

    -- Product performance with rankings
    RANK() OVER (
      PARTITION BY oa.order_month
      ORDER BY SUM(pp.total_margin) DESC
    ) as product_margin_rank,

    -- Time-based analysis
    COUNT(*) OVER (
      PARTITION BY cs.geographic_segment, oa.season
    ) as segment_seasonal_orders,

    -- Advanced statistical functions
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY oa.order_total) as median_order_value,
    STDDEV_POP(oa.order_total) as order_value_stddev,

    -- Cohort analysis elements
    MIN(oa.order_date) OVER (PARTITION BY cs.customer_id) as customer_first_order,
    MAX(oa.order_date) OVER (PARTITION BY cs.customer_id) as customer_last_order,

    -- Calculate customer lifetime metrics
    COUNT(*) OVER (PARTITION BY cs.customer_id) as customer_total_orders,
    SUM(oa.order_total) OVER (PARTITION BY cs.customer_id) as customer_lifetime_value

  FROM customer_segments cs
  JOIN order_analytics oa ON cs.customer_id = oa.customer_id
  JOIN product_performance pp ON oa.order_id = pp.order_id
),

final_analytics AS (
  SELECT 
    customer_segment,
    geographic_segment,
    order_quarter,
    season,
    product_group,
    category,
    brand,

    -- Core metrics
    unique_customers,
    total_orders,
    unique_products,
    ROUND(total_revenue, 2) as total_revenue,
    ROUND(avg_order_value, 2) as avg_order_value,
    ROUND(total_margin, 2) as total_margin,
    ROUND((total_margin / total_revenue) * 100, 1) as margin_percentage,

    -- Customer insights
    ROUND(avg_customer_monthly_spend, 2) as avg_customer_monthly_spend,
    ROUND(median_order_value, 2) as median_order_value,
    ROUND(order_value_stddev, 2) as order_value_stddev,

    -- Performance indicators
    CASE 
      WHEN product_margin_rank <= 10 THEN 'top_performer'
      WHEN product_margin_rank <= 50 THEN 'good_performer'
      ELSE 'average_performer'
    END as performance_tier,

    -- Customer behavior analysis
    AVG(EXTRACT(days FROM (customer_last_order - customer_first_order))) as avg_customer_lifespan_days,
    AVG(customer_total_orders) as avg_orders_per_customer,
    ROUND(AVG(customer_lifetime_value), 2) as avg_customer_lifetime_value,

    -- Growth analysis using LAG function
    LAG(total_revenue) OVER (
      PARTITION BY customer_segment, geographic_segment, product_group
      ORDER BY order_quarter
    ) as prev_quarter_revenue,

    -- Calculate growth rate
    ROUND(
      ((total_revenue - LAG(total_revenue) OVER (
        PARTITION BY customer_segment, geographic_segment, product_group
        ORDER BY order_quarter
      )) / NULLIF(LAG(total_revenue) OVER (
        PARTITION BY customer_segment, geographic_segment, product_group
        ORDER BY order_quarter
      ), 0)) * 100,
      1
    ) as revenue_growth_pct,

    -- Market share analysis
    ROUND(
      (total_revenue / SUM(total_revenue) OVER (PARTITION BY order_quarter)) * 100,
      2
    ) as market_share_pct,

    -- Seasonal performance indexing
    ROUND(
      total_revenue / AVG(total_revenue) OVER (
        PARTITION BY customer_segment, geographic_segment, product_group
      ) * 100,
      1
    ) as seasonal_index

  FROM comprehensive_analytics
)

SELECT 
  -- Dimensional attributes
  customer_segment,
  geographic_segment,
  order_quarter,
  season,
  product_group,
  category,
  brand,

  -- Core metrics
  unique_customers,
  total_orders,
  unique_products,
  total_revenue,
  avg_order_value,
  total_margin,
  margin_percentage,

  -- Customer insights
  avg_customer_monthly_spend,
  median_order_value,
  order_value_stddev,
  avg_customer_lifespan_days,
  avg_orders_per_customer,
  avg_customer_lifetime_value,

  -- Performance classification
  performance_tier,

  -- Growth metrics
  prev_quarter_revenue,
  revenue_growth_pct,
  market_share_pct,
  seasonal_index,

  -- Business insights and recommendations
  CASE 
    WHEN revenue_growth_pct > 25 THEN 'high_growth_opportunity'
    WHEN revenue_growth_pct > 10 THEN 'steady_growth'
    WHEN revenue_growth_pct > 0 THEN 'slow_growth'
    WHEN revenue_growth_pct IS NULL THEN 'new_segment'
    ELSE 'declining_segment'
  END as growth_classification,

  CASE 
    WHEN margin_percentage > 30 AND revenue_growth_pct > 15 THEN 'invest_and_expand'
    WHEN margin_percentage > 30 AND revenue_growth_pct < 0 THEN 'optimize_and_retain'  
    WHEN margin_percentage < 15 AND revenue_growth_pct > 15 THEN 'improve_margins'
    WHEN margin_percentage < 15 AND revenue_growth_pct < 0 THEN 'consider_exit'
    ELSE 'monitor_and_optimize'
  END as strategic_recommendation,

  -- Key performance indicators
  CASE 
    WHEN avg_customer_lifetime_value > 1000 AND avg_orders_per_customer > 5 THEN 'high_value_loyal'
    WHEN avg_customer_lifetime_value > 500 THEN 'high_value'
    WHEN avg_orders_per_customer > 3 THEN 'loyal_customers'
    ELSE 'acquisition_focus'
  END as customer_strategy

FROM final_analytics
WHERE total_revenue > 1000  -- Filter for statistical significance
ORDER BY 
  total_revenue DESC,
  revenue_growth_pct DESC NULLS LAST,
  margin_percentage DESC
LIMIT 500;

-- Real-time product performance dashboard
CREATE VIEW real_time_product_performance AS
WITH hourly_product_metrics AS (
  SELECT 
    JSON_EXTRACT(item, '$.product_id') as product_id,
    DATE_TRUNC('hour', order_date) as hour_bucket,

    -- Sales metrics
    SUM(JSON_EXTRACT(item, '$.quantity')) as total_quantity_sold,
    SUM(JSON_EXTRACT(item, '$.total_price')) as total_revenue,
    COUNT(DISTINCT order_id) as unique_orders,
    COUNT(DISTINCT customer_id) as unique_customers,

    -- Pricing analysis
    AVG(JSON_EXTRACT(item, '$.unit_price')) as avg_selling_price,
    MAX(JSON_EXTRACT(item, '$.unit_price')) as max_selling_price,
    MIN(JSON_EXTRACT(item, '$.unit_price')) as min_selling_price,

    -- Performance indicators
    AVG(JSON_EXTRACT(financial, '$.total')) as avg_order_value,
    SUM(JSON_EXTRACT(item, '$.quantity')) / COUNT(DISTINCT order_id) as avg_quantity_per_order

  FROM orders o
  CROSS JOIN JSON_TABLE(
    o.items, '$[*]' COLUMNS (
      item JSON PATH '$'
    )
  ) AS items_unnested
  WHERE o.order_date >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    AND o.status IN ('completed', 'processing', 'shipped')
  GROUP BY 
    JSON_EXTRACT(item, '$.product_id'),
    DATE_TRUNC('hour', order_date)
),

product_rankings AS (
  SELECT 
    hpm.*,
    p.product_name,
    p.category,
    p.brand,
    p.cost_per_unit,

    -- Calculate margins
    (hpm.avg_selling_price - p.cost_per_unit) as unit_margin,
    ((hpm.avg_selling_price - p.cost_per_unit) * hpm.total_quantity_sold) as total_margin,

    -- Performance rankings using window functions
    RANK() OVER (ORDER BY total_revenue DESC) as revenue_rank,
    RANK() OVER (ORDER BY total_quantity_sold DESC) as quantity_rank,
    RANK() OVER (PARTITION BY p.category ORDER BY total_revenue DESC) as category_rank,

    -- Percentile rankings
    PERCENT_RANK() OVER (ORDER BY total_revenue) as revenue_percentile,
    PERCENT_RANK() OVER (ORDER BY total_quantity_sold) as quantity_percentile,

    -- Growth analysis (comparing to previous hour)
    LAG(total_revenue) OVER (
      PARTITION BY product_id 
      ORDER BY hour_bucket
    ) as prev_hour_revenue,

    LAG(total_quantity_sold) OVER (
      PARTITION BY product_id 
      ORDER BY hour_bucket
    ) as prev_hour_quantity

  FROM hourly_product_metrics hpm
  JOIN products p ON hpm.product_id = p.product_id
  WHERE p.is_active = true
)

SELECT 
  product_id,
  product_name,
  category,
  brand,
  hour_bucket,

  -- Sales performance
  total_quantity_sold,
  ROUND(total_revenue, 2) as total_revenue,
  unique_orders,
  unique_customers,

  -- Pricing metrics
  ROUND(avg_selling_price, 2) as avg_selling_price,
  max_selling_price,
  min_selling_price,
  ROUND(unit_margin, 2) as unit_margin,
  ROUND(total_margin, 2) as total_margin,
  ROUND((total_margin / total_revenue) * 100, 1) as margin_percentage,

  -- Performance rankings
  revenue_rank,
  quantity_rank,
  category_rank,
  ROUND(revenue_percentile * 100, 1) as revenue_percentile_score,

  -- Growth metrics
  ROUND(
    CASE 
      WHEN prev_hour_revenue > 0 THEN
        ((total_revenue - prev_hour_revenue) / prev_hour_revenue) * 100
      ELSE NULL
    END,
    1
  ) as hourly_revenue_growth_pct,

  ROUND(
    CASE 
      WHEN prev_hour_quantity > 0 THEN
        ((total_quantity_sold - prev_hour_quantity) / prev_hour_quantity::DECIMAL) * 100
      ELSE NULL
    END,
    1
  ) as hourly_quantity_growth_pct,

  -- Customer metrics
  ROUND(avg_order_value, 2) as avg_order_value,
  ROUND(avg_quantity_per_order, 2) as avg_quantity_per_order,
  ROUND(total_revenue / unique_customers, 2) as revenue_per_customer,

  -- Performance classification
  CASE 
    WHEN revenue_rank <= 10 AND margin_percentage > 20 THEN 'star_performer'
    WHEN revenue_rank <= 50 AND margin_percentage > 15 THEN 'strong_performer'
    WHEN revenue_rank <= 100 THEN 'solid_performer'
    ELSE 'monitor_performance'
  END as performance_classification,

  -- Alert indicators
  CASE 
    WHEN hourly_revenue_growth_pct > 50 THEN 'trending_up'
    WHEN hourly_revenue_growth_pct < -30 THEN 'trending_down'
    WHEN revenue_rank <= 20 AND margin_percentage < 10 THEN 'margin_concern'
    ELSE 'normal'
  END as alert_status,

  -- Recommendations
  CASE 
    WHEN performance_classification = 'star_performer' THEN 'increase_inventory_and_marketing'
    WHEN alert_status = 'trending_down' THEN 'investigate_declining_performance'
    WHEN margin_percentage < 10 THEN 'review_pricing_strategy'
    WHEN revenue_rank > 100 THEN 'consider_promotion_or_discontinuation'
    ELSE 'maintain_current_strategy'
  END as recommendation

FROM product_rankings
WHERE hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
ORDER BY 
  hour_bucket DESC,
  revenue_rank ASC,
  margin_percentage DESC;

-- Advanced cohort analysis with SQL window functions
WITH customer_cohorts AS (
  SELECT 
    customer_id,
    DATE_TRUNC('month', MIN(order_date)) as cohort_month,
    MIN(order_date) as first_order_date,
    COUNT(*) as total_orders,
    SUM(JSON_EXTRACT(financial, '$.total')) as total_spent

  FROM orders
  WHERE status IN ('completed', 'delivered')
    AND order_date >= CURRENT_TIMESTAMP - INTERVAL '24 months'
  GROUP BY customer_id
),

cohort_periods AS (
  SELECT 
    cc.customer_id,
    cc.cohort_month,
    cc.first_order_date,
    cc.total_orders,
    cc.total_spent,

    o.order_date,
    o._id as order_id,
    JSON_EXTRACT(o.financial, '$.total') as order_value,

    -- Calculate periods since first order
    FLOOR(
      MONTHS_BETWEEN(DATE_TRUNC('month', o.order_date), cc.cohort_month)
    ) as periods_since_first_order,

    DATE_TRUNC('month', o.order_date) as order_month

  FROM customer_cohorts cc
  JOIN orders o ON cc.customer_id = o.customer_id
  WHERE o.status IN ('completed', 'delivered')
    AND o.order_date >= cc.first_order_date
),

cohort_analysis AS (
  SELECT 
    cohort_month,
    periods_since_first_order,
    order_month,

    -- Cohort metrics
    COUNT(DISTINCT customer_id) as active_customers,
    COUNT(DISTINCT order_id) as total_orders,
    SUM(order_value) as total_revenue,
    AVG(order_value) as avg_order_value,

    -- Customer behavior
    AVG(total_orders) as avg_lifetime_orders,
    AVG(total_spent) as avg_lifetime_value,

    -- Period-specific insights
    COUNT(DISTINCT customer_id) / 
    FIRST_VALUE(COUNT(DISTINCT customer_id)) OVER (
      PARTITION BY cohort_month 
      ORDER BY periods_since_first_order
      ROWS UNBOUNDED PRECEDING
    ) as retention_rate

  FROM cohort_periods
  GROUP BY cohort_month, periods_since_first_order, order_month
),

cohort_summary AS (
  SELECT 
    cohort_month,

    -- Cohort size (customers who made first purchase in this month)
    MAX(CASE WHEN periods_since_first_order = 0 THEN active_customers END) as cohort_size,

    -- Retention rates by period
    MAX(CASE WHEN periods_since_first_order = 1 THEN retention_rate END) as month_1_retention,
    MAX(CASE WHEN periods_since_first_order = 3 THEN retention_rate END) as month_3_retention,
    MAX(CASE WHEN periods_since_first_order = 6 THEN retention_rate END) as month_6_retention,
    MAX(CASE WHEN periods_since_first_order = 12 THEN retention_rate END) as month_12_retention,

    -- Revenue metrics
    SUM(total_revenue) as cohort_total_revenue,
    AVG(avg_lifetime_value) as avg_customer_ltv,

    -- Performance indicators
    MAX(periods_since_first_order) as max_observed_periods,
    AVG(avg_order_value) as cohort_avg_order_value

  FROM cohort_analysis
  GROUP BY cohort_month
)

SELECT 
  cohort_month,
  cohort_size,

  -- Retention analysis
  ROUND(month_1_retention * 100, 1) as month_1_retention_pct,
  ROUND(month_3_retention * 100, 1) as month_3_retention_pct,
  ROUND(month_6_retention * 100, 1) as month_6_retention_pct,
  ROUND(month_12_retention * 100, 1) as month_12_retention_pct,

  -- Financial metrics
  ROUND(cohort_total_revenue, 2) as cohort_total_revenue,
  ROUND(avg_customer_ltv, 2) as avg_customer_ltv,
  ROUND(cohort_avg_order_value, 2) as avg_order_value,
  ROUND(cohort_total_revenue / cohort_size, 2) as revenue_per_customer,

  -- Cohort performance classification
  CASE 
    WHEN month_3_retention >= 0.4 AND avg_customer_ltv >= 500 THEN 'excellent'
    WHEN month_3_retention >= 0.3 AND avg_customer_ltv >= 300 THEN 'good'
    WHEN month_3_retention >= 0.2 OR avg_customer_ltv >= 200 THEN 'fair'
    ELSE 'poor'
  END as cohort_quality,

  -- Growth trend analysis
  ROUND(
    (month_6_retention - month_1_retention) * 100,
    1
  ) as retention_trend,

  -- Business insights
  CASE 
    WHEN month_1_retention < 0.2 THEN 'improve_onboarding'
    WHEN month_12_retention < 0.1 THEN 'enhance_loyalty_program'
    WHEN avg_customer_ltv < 100 THEN 'increase_customer_value'
    ELSE 'maintain_performance'
  END as recommendation,

  max_observed_periods

FROM cohort_summary
WHERE cohort_size >= 10  -- Filter for statistical significance
ORDER BY cohort_month DESC;

-- QueryLeaf provides comprehensive aggregation capabilities:
-- 1. SQL-familiar syntax for complex MongoDB aggregation pipelines
-- 2. Advanced analytics with CTEs, window functions, and statistical operations
-- 3. Real-time processing with familiar SQL patterns and aggregation functions  
-- 4. Complex customer segmentation and behavioral analysis using SQL constructs
-- 5. Product performance analytics with rankings and growth calculations
-- 6. Cohort analysis with retention rates and lifetime value calculations
-- 7. Integration with MongoDB's native aggregation optimizations
-- 8. Familiar SQL data types, functions, and expression syntax
-- 9. Advanced time-series analysis and trend detection capabilities
-- 10. Enterprise-ready analytics with performance optimization and scalability

Best Practices for Aggregation Framework Implementation

Performance Optimization and Pipeline Design

Essential strategies for effective MongoDB Aggregation Framework usage:

Early Stage Filtering: Place $match stages as early as possible to reduce data processing volume
Index Utilization: Design compound indexes that support aggregation pipeline operations
Memory Management: Use allowDiskUse: true for large aggregations and monitor memory usage
Pipeline Ordering: Arrange stages to minimize data movement and intermediate result sizes
Expression Optimization: Use efficient expressions and avoid complex nested operations when possible
Result Set Limiting: Apply $limit stages strategically to control output size

Enterprise Analytics Architecture

Design scalable aggregation systems for production deployments:

Distributed Processing: Leverage MongoDB's sharding to distribute aggregation workloads
Caching Strategies: Implement result caching for frequently accessed aggregations
Real-time Processing: Combine aggregation pipelines with Change Streams for live analytics
Incremental Updates: Design incremental aggregation patterns for large, frequently updated datasets
Performance Monitoring: Track aggregation performance and optimize based on usage patterns
Resource Planning: Size clusters appropriately for expected aggregation workloads and data volumes

Conclusion

MongoDB's Aggregation Framework provides comprehensive data processing capabilities that eliminate the complexity and performance limitations of traditional SQL analytics approaches through optimized single-pass processing, native document transformations, and distributed execution capabilities. The rich expression language and extensive operator library enable sophisticated analytics while maintaining high performance and operational simplicity.

Key MongoDB Aggregation Framework benefits include:

Unified Processing: Single-pass analytics without expensive JOINs or multiple query rounds
Rich Expressions: Comprehensive mathematical, statistical, and analytical operations
Document-Native: Native handling of nested documents, arrays, and complex data structures
Performance Optimization: Automatic query optimization with index utilization and parallel processing
Horizontal Scaling: Distributed aggregation processing across sharded MongoDB clusters
Real-time Capabilities: Integration with Change Streams for live analytical processing

Whether you're building business intelligence platforms, real-time analytics systems, customer segmentation tools, or complex reporting solutions, MongoDB's Aggregation Framework with QueryLeaf's familiar SQL interface provides the foundation for powerful, scalable, and maintainable analytical processing.

QueryLeaf Integration: QueryLeaf seamlessly translates SQL analytics queries into optimized MongoDB aggregation pipelines while providing familiar SQL syntax for complex analytics, statistical functions, and reporting operations. Advanced aggregation patterns including cohort analysis, customer segmentation, and real-time analytics are elegantly handled through familiar SQL constructs, making sophisticated data processing both powerful and accessible to SQL-oriented analytics teams.

The combination of MongoDB's robust aggregation capabilities with SQL-style analytical operations makes it an ideal platform for applications requiring both advanced analytics functionality and familiar database interaction patterns, ensuring your analytical infrastructure can deliver insights efficiently while maintaining developer productivity and operational excellence.

November 8, 2025
30 min read

MongoDB Time Series Collections for IoT Data Management: Real-Time Analytics and High-Performance Data Processing

Modern IoT applications generate massive volumes of time-stamped sensor data that require specialized storage and processing capabilities to handle millions of data points per second while enabling real-time analytics and efficient historical data queries. Traditional database approaches struggle with the scale, write-heavy workloads, and time-based query patterns characteristic of IoT systems, often requiring complex partitioning schemes, multiple storage tiers, and custom optimization strategies that increase operational complexity and development overhead.

MongoDB Time Series Collections provide purpose-built storage optimization for time-stamped data with automatic bucketing, compression, and query optimization specifically designed for IoT workloads. Unlike traditional approaches that require manual time-based partitioning and complex indexing strategies, Time Series Collections automatically organize data by time ranges, apply intelligent compression, and optimize queries for time-based access patterns while maintaining MongoDB's flexible document model and powerful aggregation capabilities.

The Traditional IoT Data Storage Challenge

Conventional approaches to storing and processing IoT time series data face significant scalability and performance limitations:

-- Traditional PostgreSQL time series approach - complex partitioning and limited scalability

-- IoT sensor data with traditional table design
CREATE TABLE sensor_readings (
    reading_id BIGSERIAL PRIMARY KEY,
    device_id VARCHAR(100) NOT NULL,
    sensor_type VARCHAR(50) NOT NULL,
    location VARCHAR(200),

    -- Time series data
    timestamp TIMESTAMPTZ NOT NULL,
    value DECIMAL(15,6) NOT NULL,
    unit VARCHAR(20),
    quality_score DECIMAL(3,2) DEFAULT 1.0,

    -- Device and context metadata
    device_metadata JSONB,
    environmental_conditions JSONB,

    -- Data processing flags
    processed BOOLEAN DEFAULT FALSE,
    anomaly_detected BOOLEAN DEFAULT FALSE,
    data_source VARCHAR(100),

    -- Partitioning helper columns
    year_month INTEGER GENERATED ALWAYS AS (EXTRACT(YEAR FROM timestamp) * 100 + EXTRACT(MONTH FROM timestamp)) STORED,
    date_partition DATE GENERATED ALWAYS AS (DATE(timestamp)) STORED
);

-- Complex time-based partitioning (manual maintenance required)
CREATE TABLE sensor_readings_2024_01 PARTITION OF sensor_readings
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE sensor_readings_2024_02 PARTITION OF sensor_readings
FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

CREATE TABLE sensor_readings_2024_03 PARTITION OF sensor_readings
FOR VALUES FROM ('2024-03-01') TO ('2024-04-01');

-- Additional partitions must be created manually each month
-- Automation required to prevent partition overflow

-- Indexing strategy for time series queries (expensive maintenance)
CREATE INDEX idx_sensor_readings_device_time ON sensor_readings (device_id, timestamp DESC);
CREATE INDEX idx_sensor_readings_sensor_type_time ON sensor_readings (sensor_type, timestamp DESC);
CREATE INDEX idx_sensor_readings_location_time ON sensor_readings (location, timestamp DESC);
CREATE INDEX idx_sensor_readings_timestamp_only ON sensor_readings (timestamp DESC);
CREATE INDEX idx_sensor_readings_processed_flag ON sensor_readings (processed, timestamp DESC);

-- Additional indexes for different query patterns
CREATE INDEX idx_sensor_readings_anomaly_time ON sensor_readings (anomaly_detected, timestamp DESC) WHERE anomaly_detected = TRUE;
CREATE INDEX idx_sensor_readings_device_type_time ON sensor_readings (device_id, sensor_type, timestamp DESC);

-- Materialized view for real-time aggregations (complex maintenance)
CREATE MATERIALIZED VIEW sensor_readings_hourly_summary AS
WITH hourly_aggregations AS (
    SELECT 
        device_id,
        sensor_type,
        location,
        DATE_TRUNC('hour', timestamp) as hour_bucket,

        -- Statistical aggregations
        COUNT(*) as reading_count,
        AVG(value) as avg_value,
        MIN(value) as min_value,
        MAX(value) as max_value,
        STDDEV(value) as stddev_value,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value) as median_value,
        PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY value) as p95_value,

        -- Data quality metrics
        AVG(quality_score) as avg_quality,
        COUNT(*) FILTER (WHERE quality_score < 0.8) as low_quality_readings,
        COUNT(*) FILTER (WHERE anomaly_detected = true) as anomaly_count,

        -- Value change analysis
        (MAX(value) - MIN(value)) as value_range,
        CASE 
            WHEN COUNT(*) > 1 THEN
                (LAST_VALUE(value) OVER (ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) - 
                 FIRST_VALUE(value) OVER (ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING))
            ELSE 0
        END as value_change_in_hour,

        -- Processing statistics
        COUNT(*) FILTER (WHERE processed = true) as processed_readings,
        (COUNT(*) FILTER (WHERE processed = true)::DECIMAL / COUNT(*)) * 100 as processing_rate_percent,

        -- Time coverage analysis
        (EXTRACT(EPOCH FROM MAX(timestamp) - MIN(timestamp)) / 3600) as time_coverage_hours,
        COUNT(*)::DECIMAL / (EXTRACT(EPOCH FROM MAX(timestamp) - MIN(timestamp)) / 60) as readings_per_minute

    FROM sensor_readings
    WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    GROUP BY device_id, sensor_type, location, DATE_TRUNC('hour', timestamp)
)
SELECT 
    ha.*,

    -- Additional calculated metrics
    CASE 
        WHEN ha.reading_count < 50 THEN 'sparse'
        WHEN ha.reading_count < 200 THEN 'normal'
        WHEN ha.reading_count < 500 THEN 'dense'
        ELSE 'very_dense'
    END as data_density_category,

    CASE 
        WHEN ha.avg_quality >= 0.95 THEN 'excellent'
        WHEN ha.avg_quality >= 0.8 THEN 'good'
        WHEN ha.avg_quality >= 0.6 THEN 'fair'
        ELSE 'poor'
    END as quality_category,

    -- Anomaly rate analysis
    CASE 
        WHEN ha.anomaly_count = 0 THEN 'normal'
        WHEN (ha.anomaly_count::DECIMAL / ha.reading_count) < 0.01 THEN 'low_anomalies'
        WHEN (ha.anomaly_count::DECIMAL / ha.reading_count) < 0.05 THEN 'moderate_anomalies'
        ELSE 'high_anomalies'
    END as anomaly_level,

    -- Performance indicators
    CASE 
        WHEN ha.readings_per_minute >= 10 THEN 'high_frequency'
        WHEN ha.readings_per_minute >= 1 THEN 'medium_frequency'
        WHEN ha.readings_per_minute >= 0.1 THEN 'low_frequency'
        ELSE 'very_low_frequency'
    END as sampling_frequency_category

FROM hourly_aggregations ha;

-- Must be refreshed periodically (expensive operation)
CREATE UNIQUE INDEX idx_sensor_hourly_unique ON sensor_readings_hourly_summary (device_id, sensor_type, location, hour_bucket);

-- Complex query for real-time analytics (resource-intensive)
WITH device_performance AS (
    SELECT 
        sr.device_id,
        sr.sensor_type,
        sr.location,
        DATE_TRUNC('minute', sr.timestamp) as minute_bucket,

        -- Real-time aggregations (expensive on large datasets)
        COUNT(*) as readings_per_minute,
        AVG(sr.value) as avg_value,
        STDDEV(sr.value) as value_stability,

        -- Change detection (requires window functions)
        LAG(AVG(sr.value)) OVER (
            PARTITION BY sr.device_id, sr.sensor_type 
            ORDER BY DATE_TRUNC('minute', sr.timestamp)
        ) as prev_minute_avg,

        -- Quality assessment
        AVG(sr.quality_score) as avg_quality,
        COUNT(*) FILTER (WHERE sr.anomaly_detected) as anomaly_count,

        -- Processing lag calculation
        AVG(EXTRACT(EPOCH FROM CURRENT_TIMESTAMP - sr.timestamp)) as avg_processing_lag_seconds

    FROM sensor_readings sr
    WHERE 
        sr.timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
        AND sr.processed = true
    GROUP BY sr.device_id, sr.sensor_type, sr.location, DATE_TRUNC('minute', sr.timestamp)
),

real_time_alerts AS (
    SELECT 
        dp.*,

        -- Alert conditions
        CASE 
            WHEN ABS(dp.avg_value - dp.prev_minute_avg) > (dp.value_stability * 3) THEN 'value_spike'
            WHEN dp.avg_quality < 0.7 THEN 'quality_degradation'
            WHEN dp.anomaly_count > 0 THEN 'anomalies_detected'
            WHEN dp.avg_processing_lag_seconds > 300 THEN 'processing_delay'
            WHEN dp.readings_per_minute < 0.5 THEN 'data_gap'
            ELSE 'normal'
        END as alert_type,

        -- Severity calculation
        CASE 
            WHEN dp.anomaly_count > 10 OR dp.avg_quality < 0.5 THEN 'critical'
            WHEN dp.anomaly_count > 5 OR dp.avg_quality < 0.7 OR dp.avg_processing_lag_seconds > 600 THEN 'high'
            WHEN dp.anomaly_count > 2 OR dp.avg_quality < 0.8 OR dp.avg_processing_lag_seconds > 300 THEN 'medium'
            ELSE 'low'
        END as alert_severity,

        -- Performance assessment
        CASE 
            WHEN dp.readings_per_minute >= 30 AND dp.avg_quality >= 0.9 THEN 'optimal'
            WHEN dp.readings_per_minute >= 10 AND dp.avg_quality >= 0.8 THEN 'good'
            WHEN dp.readings_per_minute >= 1 AND dp.avg_quality >= 0.6 THEN 'acceptable'
            ELSE 'poor'
        END as performance_status

    FROM device_performance dp
    WHERE dp.minute_bucket >= CURRENT_TIMESTAMP - INTERVAL '15 minutes'
),

device_health_summary AS (
    SELECT 
        rta.device_id,
        COUNT(*) as total_minutes_analyzed,

        -- Health metrics
        COUNT(*) FILTER (WHERE rta.alert_type != 'normal') as minutes_with_alerts,
        COUNT(*) FILTER (WHERE rta.alert_severity IN ('critical', 'high')) as critical_minutes,
        COUNT(*) FILTER (WHERE rta.performance_status IN ('optimal', 'good')) as good_performance_minutes,

        -- Overall device status
        AVG(rta.avg_quality) as overall_quality,
        AVG(rta.readings_per_minute) as avg_data_rate,
        SUM(rta.anomaly_count) as total_anomalies,

        -- Most recent status
        LAST_VALUE(rta.performance_status) OVER (
            PARTITION BY rta.device_id 
            ORDER BY rta.minute_bucket 
            ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
        ) as current_status,

        LAST_VALUE(rta.alert_type) OVER (
            PARTITION BY rta.device_id 
            ORDER BY rta.minute_bucket 
            ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
        ) as current_alert_type

    FROM real_time_alerts rta
    GROUP BY rta.device_id
)

-- Final real-time dashboard query
SELECT 
    dhs.device_id,
    dhs.current_status,
    dhs.current_alert_type,

    -- Health indicators
    ROUND(dhs.overall_quality, 3) as quality_score,
    ROUND(dhs.avg_data_rate, 1) as data_rate_per_minute,
    dhs.total_anomalies,

    -- Alert summary
    dhs.minutes_with_alerts,
    dhs.critical_minutes,
    dhs.good_performance_minutes,

    -- Performance assessment
    ROUND((dhs.good_performance_minutes::DECIMAL / dhs.total_minutes_analyzed) * 100, 1) as uptime_percentage,
    ROUND((dhs.minutes_with_alerts::DECIMAL / dhs.total_minutes_analyzed) * 100, 1) as alert_percentage,

    -- Device health classification
    CASE 
        WHEN dhs.critical_minutes > 2 OR dhs.overall_quality < 0.6 THEN 'unhealthy'
        WHEN dhs.minutes_with_alerts > 5 OR dhs.overall_quality < 0.8 THEN 'degraded'
        WHEN dhs.good_performance_minutes >= (dhs.total_minutes_analyzed * 0.8) THEN 'healthy'
        ELSE 'monitoring'
    END as device_health_status,

    -- Recommendations
    CASE 
        WHEN dhs.total_anomalies > 20 THEN 'investigate_sensor_calibration'
        WHEN dhs.avg_data_rate < 1 THEN 'check_connectivity'
        WHEN dhs.overall_quality < 0.7 THEN 'review_sensor_maintenance'
        WHEN dhs.critical_minutes > 0 THEN 'immediate_attention_required'
        ELSE 'operating_normally'
    END as recommended_action

FROM device_health_summary dhs
ORDER BY 
    CASE dhs.current_status
        WHEN 'poor' THEN 1
        WHEN 'acceptable' THEN 2
        WHEN 'good' THEN 3
        WHEN 'optimal' THEN 4
    END,
    dhs.critical_minutes DESC,
    dhs.total_anomalies DESC;

-- Traditional time series problems:
-- 1. Complex manual partitioning and maintenance overhead
-- 2. Expensive materialized view refreshes for real-time analytics
-- 3. Limited compression and storage optimization for time series data
-- 4. Complex indexing strategies with high maintenance costs
-- 5. Poor write performance under high-volume IoT workloads
-- 6. Difficult horizontal scaling for time series data
-- 7. Limited time-based query optimization
-- 8. Complex time window and rollup aggregations
-- 9. Expensive historical data archiving and cleanup operations
-- 10. No built-in time series specific features and optimizations

MongoDB Time Series Collections provide comprehensive IoT data management with automatic optimization and intelligent compression:

// MongoDB Time Series Collections - Optimized IoT data storage and analytics
const { MongoClient, ObjectId } = require('mongodb');

// Comprehensive IoT Time Series Data Manager
class IoTTimeSeriesManager {
  constructor(connectionString, iotConfig = {}) {
    this.connectionString = connectionString;
    this.client = null;
    this.db = null;

    this.config = {
      // Time series configuration
      defaultGranularity: iotConfig.defaultGranularity || 'seconds',
      enableAutomaticIndexing: iotConfig.enableAutomaticIndexing !== false,
      enableCompression: iotConfig.enableCompression !== false,

      // IoT-specific features
      enableRealTimeAlerts: iotConfig.enableRealTimeAlerts !== false,
      enableAnomalyDetection: iotConfig.enableAnomalyDetection !== false,
      enablePredictiveAnalytics: iotConfig.enablePredictiveAnalytics !== false,
      enableDataQualityMonitoring: iotConfig.enableDataQualityMonitoring !== false,

      // Performance optimization
      batchWriteSize: iotConfig.batchWriteSize || 1000,
      writeConcern: iotConfig.writeConcern || { w: 1, j: true },
      readPreference: iotConfig.readPreference || 'primaryPreferred',
      maxConnectionPoolSize: iotConfig.maxConnectionPoolSize || 100,

      // Data retention and archival
      enableDataLifecycleManagement: iotConfig.enableDataLifecycleManagement !== false,
      defaultRetentionDays: iotConfig.defaultRetentionDays || 365,
      enableAutomaticArchiving: iotConfig.enableAutomaticArchiving !== false,

      // Analytics and processing
      enableStreamProcessing: iotConfig.enableStreamProcessing !== false,
      enableRealTimeAggregation: iotConfig.enableRealTimeAggregation !== false,
      aggregationWindowSize: iotConfig.aggregationWindowSize || '1 minute',

      ...iotConfig
    };

    // Time series collections for different data types
    this.timeSeriesCollections = new Map();
    this.aggregationCollections = new Map();
    this.alertCollections = new Map();

    // Real-time processing components
    this.changeStreams = new Map();
    this.processingPipelines = new Map();
    this.alertRules = new Map();

    // Performance metrics
    this.performanceMetrics = {
      totalDataPoints: 0,
      writeOperationsPerSecond: 0,
      queryOperationsPerSecond: 0,
      averageLatency: 0,
      compressionRatio: 0,
      alertsTriggered: 0
    };
  }

  async initializeIoTTimeSeriesSystem() {
    console.log('Initializing MongoDB IoT Time Series system...');

    try {
      // Connect to MongoDB
      this.client = new MongoClient(this.connectionString, {
        maxPoolSize: this.config.maxConnectionPoolSize,
        writeConcern: this.config.writeConcern,
        readPreference: this.config.readPreference
      });

      await this.client.connect();
      this.db = this.client.db();

      // Create time series collections for different sensor types
      await this.createTimeSeriesCollections();

      // Setup real-time processing pipelines
      if (this.config.enableStreamProcessing) {
        await this.setupStreamProcessing();
      }

      // Initialize real-time aggregations
      if (this.config.enableRealTimeAggregation) {
        await this.setupRealTimeAggregations();
      }

      // Setup anomaly detection
      if (this.config.enableAnomalyDetection) {
        await this.setupAnomalyDetection();
      }

      // Initialize data lifecycle management
      if (this.config.enableDataLifecycleManagement) {
        await this.setupDataLifecycleManagement();
      }

      console.log('IoT Time Series system initialized successfully');

    } catch (error) {
      console.error('Error initializing IoT Time Series system:', error);
      throw error;
    }
  }

  async createTimeSeriesCollections() {
    console.log('Creating optimized time series collections...');

    // Sensor readings time series collection with automatic bucketing
    await this.createOptimizedTimeSeriesCollection('sensor_readings', {
      timeField: 'timestamp',
      metaField: 'device',
      granularity: this.config.defaultGranularity,
      bucketMaxSpanSeconds: 3600, // 1 hour buckets
      bucketRoundingSeconds: 60,  // Round to nearest minute

      // Optimize for IoT data patterns
      expireAfterSeconds: this.config.defaultRetentionDays * 24 * 60 * 60,

      // Index optimization for common IoT queries
      additionalIndexes: [
        { 'device.id': 1, 'timestamp': -1 },
        { 'device.type': 1, 'timestamp': -1 },
        { 'device.location': 1, 'timestamp': -1 },
        { 'sensor.type': 1, 'timestamp': -1 }
      ]
    });

    // Environmental monitoring time series
    await this.createOptimizedTimeSeriesCollection('environmental_data', {
      timeField: 'timestamp',
      metaField: 'location',
      granularity: 'minutes',
      bucketMaxSpanSeconds: 7200, // 2 hour buckets for slower changing data

      additionalIndexes: [
        { 'location.facility': 1, 'location.zone': 1, 'timestamp': -1 },
        { 'sensor_type': 1, 'timestamp': -1 }
      ]
    });

    // Equipment performance monitoring
    await this.createOptimizedTimeSeriesCollection('equipment_metrics', {
      timeField: 'timestamp',
      metaField: 'equipment',
      granularity: 'seconds',
      bucketMaxSpanSeconds: 1800, // 30 minute buckets for high-frequency data

      additionalIndexes: [
        { 'equipment.id': 1, 'equipment.type': 1, 'timestamp': -1 },
        { 'metric_type': 1, 'timestamp': -1 }
      ]
    });

    // Energy consumption tracking
    await this.createOptimizedTimeSeriesCollection('energy_consumption', {
      timeField: 'timestamp',
      metaField: 'meter',
      granularity: 'minutes',
      bucketMaxSpanSeconds: 3600, // 1 hour buckets

      additionalIndexes: [
        { 'meter.id': 1, 'timestamp': -1 },
        { 'meter.building': 1, 'meter.floor': 1, 'timestamp': -1 }
      ]
    });

    // Vehicle telemetry data
    await this.createOptimizedTimeSeriesCollection('vehicle_telemetry', {
      timeField: 'timestamp',
      metaField: 'vehicle',
      granularity: 'seconds',
      bucketMaxSpanSeconds: 900, // 15 minute buckets for mobile data

      additionalIndexes: [
        { 'vehicle.id': 1, 'timestamp': -1 },
        { 'vehicle.route': 1, 'timestamp': -1 },
        { 'telemetry_type': 1, 'timestamp': -1 }
      ]
    });

    console.log('Time series collections created successfully');
  }

  async createOptimizedTimeSeriesCollection(collectionName, config) {
    console.log(`Creating time series collection: ${collectionName}`);

    try {
      // Create time series collection with MongoDB's native optimization
      const collection = await this.db.createCollection(collectionName, {
        timeseries: {
          timeField: config.timeField,
          metaField: config.metaField,
          granularity: config.granularity,
          bucketMaxSpanSeconds: config.bucketMaxSpanSeconds,
          bucketRoundingSeconds: config.bucketRoundingSeconds || 60
        },

        // Set TTL for automatic data expiration
        ...(config.expireAfterSeconds && {
          expireAfterSeconds: config.expireAfterSeconds
        }),

        // Enable compression for storage optimization
        storageEngine: {
          wiredTiger: {
            configString: 'block_compressor=zstd'
          }
        }
      });

      // Create additional indexes for query optimization
      if (config.additionalIndexes) {
        await collection.createIndexes(
          config.additionalIndexes.map(indexSpec => ({
            key: indexSpec,
            background: true,
            name: `idx_${Object.keys(indexSpec).join('_')}`
          }))
        );
      }

      // Store collection reference and configuration
      this.timeSeriesCollections.set(collectionName, {
        collection: collection,
        config: config,
        stats: {
          documentsInserted: 0,
          bytesStored: 0,
          compressionRatio: 0,
          lastInsertTime: null
        }
      });

      console.log(`Time series collection ${collectionName} created successfully`);

    } catch (error) {
      console.error(`Error creating time series collection ${collectionName}:`, error);
      throw error;
    }
  }

  async insertSensorData(collectionName, sensorDataPoints) {
    const startTime = Date.now();

    try {
      const collectionInfo = this.timeSeriesCollections.get(collectionName);
      if (!collectionInfo) {
        throw new Error(`Time series collection ${collectionName} not found`);
      }

      const collection = collectionInfo.collection;

      // Prepare data points with enhanced metadata
      const enhancedDataPoints = sensorDataPoints.map(dataPoint => ({
        // Time series fields
        timestamp: dataPoint.timestamp || new Date(),

        // Device/sensor metadata (automatically indexed as metaField)
        device: {
          id: dataPoint.deviceId,
          type: dataPoint.deviceType || 'generic_sensor',
          location: {
            facility: dataPoint.facility || 'unknown',
            zone: dataPoint.zone || 'default',
            coordinates: dataPoint.coordinates || null,
            floor: dataPoint.floor || null,
            room: dataPoint.room || null
          },
          firmware: dataPoint.firmwareVersion || null,
          manufacturer: dataPoint.manufacturer || null,
          model: dataPoint.model || null,
          installDate: dataPoint.installDate || null
        },

        // Sensor information
        sensor: {
          type: dataPoint.sensorType,
          unit: dataPoint.unit || null,
          precision: dataPoint.precision || null,
          calibrationDate: dataPoint.calibrationDate || null,
          maintenanceSchedule: dataPoint.maintenanceSchedule || null
        },

        // Measurement data
        value: dataPoint.value,
        rawValue: dataPoint.rawValue || dataPoint.value,

        // Data quality indicators
        quality: {
          score: dataPoint.qualityScore || 1.0,
          flags: dataPoint.qualityFlags || [],
          confidence: dataPoint.confidence || 1.0,
          calibrationStatus: dataPoint.calibrationStatus || 'valid',
          sensorHealth: dataPoint.sensorHealth || 'healthy'
        },

        // Environmental context
        environmentalConditions: {
          temperature: dataPoint.ambientTemperature || null,
          humidity: dataPoint.ambientHumidity || null,
          pressure: dataPoint.atmosphericPressure || null,
          vibration: dataPoint.vibrationLevel || null,
          electricalNoise: dataPoint.electricalNoise || null
        },

        // Processing metadata
        processing: {
          receivedAt: new Date(),
          source: dataPoint.dataSource || 'direct',
          protocol: dataPoint.protocol || 'unknown',
          gateway: dataPoint.gatewayId || null,
          processingLatency: dataPoint.processingLatency || null,
          networkLatency: dataPoint.networkLatency || null
        },

        // Alert and anomaly flags
        alerts: {
          anomalyDetected: dataPoint.anomalyDetected || false,
          thresholdViolation: dataPoint.thresholdViolation || null,
          alertLevel: dataPoint.alertLevel || 'normal',
          alertReason: dataPoint.alertReason || null
        },

        // Business context
        businessContext: {
          assetId: dataPoint.assetId || null,
          processId: dataPoint.processId || null,
          operationalMode: dataPoint.operationalMode || 'normal',
          shiftId: dataPoint.shiftId || null,
          operatorId: dataPoint.operatorId || null
        },

        // Additional custom metadata
        customMetadata: dataPoint.customMetadata || {}
      }));

      // Perform batch insert with write concern
      const result = await collection.insertMany(enhancedDataPoints, {
        writeConcern: this.config.writeConcern,
        ordered: false // Allow partial success for better performance
      });

      const insertTime = Date.now() - startTime;

      // Update collection statistics
      collectionInfo.stats.documentsInserted += result.insertedCount;
      collectionInfo.stats.lastInsertTime = new Date();

      // Update performance metrics
      this.updatePerformanceMetrics('insert', result.insertedCount, insertTime);

      // Trigger real-time processing if enabled
      if (this.config.enableStreamProcessing) {
        await this.processRealTimeData(collectionName, enhancedDataPoints);
      }

      // Check for alerts if enabled
      if (this.config.enableRealTimeAlerts) {
        await this.checkAlertConditions(collectionName, enhancedDataPoints);
      }

      console.log(`Inserted ${result.insertedCount} sensor data points into ${collectionName} in ${insertTime}ms`);

      return {
        success: true,
        collection: collectionName,
        insertedCount: result.insertedCount,
        insertTime: insertTime,
        insertedIds: result.insertedIds
      };

    } catch (error) {
      console.error(`Error inserting sensor data into ${collectionName}:`, error);
      return {
        success: false,
        collection: collectionName,
        error: error.message,
        insertTime: Date.now() - startTime
      };
    }
  }

  async queryTimeSeriesData(collectionName, query) {
    const startTime = Date.now();

    try {
      const collectionInfo = this.timeSeriesCollections.get(collectionName);
      if (!collectionInfo) {
        throw new Error(`Time series collection ${collectionName} not found`);
      }

      const collection = collectionInfo.collection;

      // Build comprehensive aggregation pipeline for time series analysis
      const pipeline = [
        // Time range filtering (optimized for time series collections)
        {
          $match: {
            timestamp: {
              $gte: query.startTime,
              $lte: query.endTime || new Date()
            },
            ...(query.deviceIds && { 'device.id': { $in: query.deviceIds } }),
            ...(query.deviceTypes && { 'device.type': { $in: query.deviceTypes } }),
            ...(query.sensorTypes && { 'sensor.type': { $in: query.sensorTypes } }),
            ...(query.locations && { 'device.location.facility': { $in: query.locations } }),
            ...(query.minQualityScore && { 'quality.score': { $gte: query.minQualityScore } }),
            ...(query.alertLevel && { 'alerts.alertLevel': query.alertLevel })
          }
        },

        // Time-based grouping and aggregation
        {
          $group: {
            _id: {
              deviceId: '$device.id',
              deviceType: '$device.type',
              sensorType: '$sensor.type',
              location: '$device.location',

              // Time bucketing based on query granularity
              timeBucket: query.granularity === 'hour' 
                ? { $dateTrunc: { date: '$timestamp', unit: 'hour' } }
                : query.granularity === 'minute'
                ? { $dateTrunc: { date: '$timestamp', unit: 'minute' } }
                : query.granularity === 'day'
                ? { $dateTrunc: { date: '$timestamp', unit: 'day' } }
                : '$timestamp' // Raw timestamp for second-level granularity
            },

            // Statistical aggregations
            count: { $sum: 1 },
            avgValue: { $avg: '$value' },
            minValue: { $min: '$value' },
            maxValue: { $max: '$value' },
            sumValue: { $sum: '$value' },

            // Advanced statistical measures
            stdDevValue: { $stdDevPop: '$value' },
            varianceValue: { $pow: [{ $stdDevPop: '$value' }, 2] },

            // Percentile calculations using $percentile (MongoDB 7.0+)
            percentiles: {
              $percentile: {
                input: '$value',
                p: [0.25, 0.5, 0.75, 0.9, 0.95, 0.99],
                method: 'approximate'
              }
            },

            // Data quality metrics
            avgQualityScore: { $avg: '$quality.score' },
            minQualityScore: { $min: '$quality.score' },
            lowQualityCount: {
              $sum: { $cond: [{ $lt: ['$quality.score', 0.8] }, 1, 0] }
            },

            // Alert and anomaly statistics
            anomalyCount: {
              $sum: { $cond: ['$alerts.anomalyDetected', 1, 0] }
            },
            alertCounts: {
              $push: {
                $cond: [
                  { $ne: ['$alerts.alertLevel', 'normal'] },
                  '$alerts.alertLevel',
                  '$$REMOVE'
                ]
              }
            },

            // Time-based metrics
            firstReading: { $min: '$timestamp' },
            lastReading: { $max: '$timestamp' },

            // Value change analysis
            valueRange: { $subtract: [{ $max: '$value' }, { $min: '$value' }] },

            // Environmental conditions (if available)
            avgAmbientTemp: { $avg: '$environmentalConditions.temperature' },
            avgAmbientHumidity: { $avg: '$environmentalConditions.humidity' },

            // Processing performance
            avgProcessingLatency: { $avg: '$processing.processingLatency' },
            maxProcessingLatency: { $max: '$processing.processingLatency' },

            // Raw data points (if requested for detailed analysis)
            ...(query.includeRawData && {
              rawDataPoints: {
                $push: {
                  timestamp: '$timestamp',
                  value: '$value',
                  quality: '$quality.score',
                  anomaly: '$alerts.anomalyDetected'
                }
              }
            })
          }
        },

        // Calculate additional derived metrics
        {
          $addFields: {
            // Time coverage and sampling rate analysis
            timeCoverageSeconds: {
              $divide: [
                { $subtract: ['$lastReading', '$firstReading'] },
                1000
              ]
            },

            // Data completeness analysis
            expectedReadings: {
              $cond: [
                { $eq: [query.granularity, 'minute'] },
                { $divide: [{ $subtract: ['$lastReading', '$firstReading'] }, 60000] },
                { $cond: [
                  { $eq: [query.granularity, 'hour'] },
                  { $divide: [{ $subtract: ['$lastReading', '$firstReading'] }, 3600000] },
                  '$count'
                ]}
              ]
            },

            // Statistical analysis
            coefficientOfVariation: {
              $cond: [
                { $ne: ['$avgValue', 0] },
                { $divide: ['$stdDevValue', '$avgValue'] },
                0
              ]
            },

            // Data quality percentage
            qualityPercentage: {
              $multiply: [
                { $divide: [
                  { $subtract: ['$count', '$lowQualityCount'] },
                  '$count'
                ]},
                100
              ]
            },

            // Anomaly rate
            anomalyRate: {
              $multiply: [
                { $divide: ['$anomalyCount', '$count'] },
                100
              ]
            },

            // Alert distribution
            alertDistribution: {
              $reduce: {
                input: '$alertCounts',
                initialValue: {},
                in: {
                  $mergeObjects: [
                    '$$value',
                    { ['$$this']: { $add: [{ $ifNull: [{ $getField: { field: '$$this', input: '$$value' } }, 0] }, 1] } }
                  ]
                }
              }
            },

            // Performance classification
            performanceCategory: {
              $switch: {
                branches: [
                  { 
                    case: { 
                      $and: [
                        { $gte: ['$qualityPercentage', 95] },
                        { $lt: ['$anomalyRate', 1] },
                        { $lte: ['$avgProcessingLatency', 100] }
                      ]
                    }, 
                    then: 'excellent' 
                  },
                  { 
                    case: { 
                      $and: [
                        { $gte: ['$qualityPercentage', 85] },
                        { $lt: ['$anomalyRate', 5] },
                        { $lte: ['$avgProcessingLatency', 300] }
                      ]
                    }, 
                    then: 'good' 
                  },
                  { 
                    case: { 
                      $and: [
                        { $gte: ['$qualityPercentage', 70] },
                        { $lt: ['$anomalyRate', 10] }
                      ]
                    }, 
                    then: 'fair' 
                  }
                ],
                default: 'poor'
              }
            },

            // Trending analysis (basic)
            valueTrend: {
              $cond: [
                { $and: [
                  { $ne: ['$minValue', '$maxValue'] },
                  { $gt: ['$count', 1] }
                ]},
                {
                  $switch: {
                    branches: [
                      { case: { $gt: ['$valueRange', { $multiply: ['$avgValue', 0.2] }] }, then: 'volatile' },
                      { case: { $gt: ['$coefficientOfVariation', 0.3] }, then: 'variable' },
                      { case: { $lt: ['$coefficientOfVariation', 0.1] }, then: 'stable' }
                    ],
                    default: 'moderate'
                  }
                },
                'insufficient_data'
              ]
            }
          }
        },

        // Data completeness analysis
        {
          $addFields: {
            dataCompleteness: {
              $multiply: [
                { $divide: ['$count', { $max: ['$expectedReadings', 1] }] },
                100
              ]
            },

            // Sampling rate (readings per minute)
            samplingRate: {
              $cond: [
                { $gt: ['$timeCoverageSeconds', 0] },
                { $divide: ['$count', { $divide: ['$timeCoverageSeconds', 60] }] },
                0
              ]
            }
          }
        },

        // Final projection and organization
        {
          $project: {
            // Identity fields
            deviceId: '$_id.deviceId',
            deviceType: '$_id.deviceType',
            sensorType: '$_id.sensorType',
            location: '$_id.location',
            timeBucket: '$_id.timeBucket',

            // Basic statistics
            dataPoints: '$count',
            statistics: {
              avg: { $round: ['$avgValue', 4] },
              min: '$minValue',
              max: '$maxValue',
              sum: { $round: ['$sumValue', 2] },
              stdDev: { $round: ['$stdDevValue', 4] },
              variance: { $round: ['$varianceValue', 4] },
              coefficientOfVariation: { $round: ['$coefficientOfVariation', 4] },
              valueRange: { $round: ['$valueRange', 4] },
              percentiles: '$percentiles'
            },

            // Data quality metrics
            dataQuality: {
              avgScore: { $round: ['$avgQualityScore', 3] },
              minScore: { $round: ['$minQualityScore', 3] },
              qualityPercentage: { $round: ['$qualityPercentage', 1] },
              lowQualityCount: '$lowQualityCount'
            },

            // Alert and anomaly information
            alerts: {
              anomalyCount: '$anomalyCount',
              anomalyRate: { $round: ['$anomalyRate', 2] },
              alertDistribution: '$alertDistribution'
            },

            // Time-based analysis
            temporal: {
              firstReading: '$firstReading',
              lastReading: '$lastReading',
              timeCoverageSeconds: { $round: ['$timeCoverageSeconds', 0] },
              dataCompleteness: { $round: ['$dataCompleteness', 1] },
              samplingRate: { $round: ['$samplingRate', 2] }
            },

            // Environmental context
            environment: {
              avgTemperature: { $round: ['$avgAmbientTemp', 1] },
              avgHumidity: { $round: ['$avgAmbientHumidity', 1] }
            },

            // Performance metrics
            performance: {
              avgProcessingLatency: { $round: ['$avgProcessingLatency', 0] },
              maxProcessingLatency: { $round: ['$maxProcessingLatency', 0] },
              performanceCategory: '$performanceCategory'
            },

            // Analysis results
            analysis: {
              valueTrend: '$valueTrend',
              overallAssessment: {
                $switch: {
                  branches: [
                    { 
                      case: { 
                        $and: [
                          { $eq: ['$performanceCategory', 'excellent'] },
                          { $gte: ['$dataCompleteness', 95] }
                        ]
                      }, 
                      then: 'optimal_performance' 
                    },
                    { 
                      case: { 
                        $and: [
                          { $in: ['$performanceCategory', ['good', 'excellent']] },
                          { $gte: ['$dataCompleteness', 80] }
                        ]
                      }, 
                      then: 'good_performance' 
                    },
                    { 
                      case: { $lt: ['$dataCompleteness', 50] }, 
                      then: 'data_gaps_detected' 
                    },
                    { 
                      case: { $gt: ['$anomalyRate', 15] }, 
                      then: 'high_anomaly_rate' 
                    },
                    { 
                      case: { $lt: ['$qualityPercentage', 70] }, 
                      then: 'quality_issues' 
                    }
                  ],
                  default: 'acceptable_performance'
                }
              }
            },

            // Include raw data if requested
            ...(query.includeRawData && { rawDataPoints: 1 })
          }
        },

        // Sort results
        { $sort: { deviceId: 1, timeBucket: 1 } },

        // Apply result limits
        ...(query.limit && [{ $limit: query.limit }])
      ];

      // Execute aggregation pipeline
      const results = await collection.aggregate(pipeline, {
        allowDiskUse: true,
        maxTimeMS: 30000
      }).toArray();

      const queryTime = Date.now() - startTime;

      // Update performance metrics
      this.updatePerformanceMetrics('query', results.length, queryTime);

      console.log(`Time series query completed: ${results.length} results in ${queryTime}ms`);

      return {
        success: true,
        collection: collectionName,
        results: results,
        resultCount: results.length,
        queryTime: queryTime,
        queryMetadata: {
          timeRange: {
            start: query.startTime,
            end: query.endTime || new Date()
          },
          granularity: query.granularity || 'raw',
          filters: {
            deviceIds: query.deviceIds?.length || 0,
            deviceTypes: query.deviceTypes?.length || 0,
            sensorTypes: query.sensorTypes?.length || 0,
            locations: query.locations?.length || 0
          },
          optimizationsApplied: ['time_series_bucketing', 'statistical_aggregation', 'index_optimization']
        }
      };

    } catch (error) {
      console.error(`Error querying time series data from ${collectionName}:`, error);
      return {
        success: false,
        collection: collectionName,
        error: error.message,
        queryTime: Date.now() - startTime
      };
    }
  }

  async setupRealTimeAggregations() {
    console.log('Setting up real-time aggregation pipelines...');

    // Create aggregation collections for different time windows
    const aggregationConfigs = [
      {
        name: 'sensor_readings_1min',
        sourceCollection: 'sensor_readings',
        windowSize: '1 minute',
        retentionDays: 7
      },
      {
        name: 'sensor_readings_5min',
        sourceCollection: 'sensor_readings',
        windowSize: '5 minutes',
        retentionDays: 30
      },
      {
        name: 'sensor_readings_1hour',
        sourceCollection: 'sensor_readings', 
        windowSize: '1 hour',
        retentionDays: 365
      },
      {
        name: 'sensor_readings_1day',
        sourceCollection: 'sensor_readings',
        windowSize: '1 day',
        retentionDays: 1825 // 5 years
      }
    ];

    for (const config of aggregationConfigs) {
      await this.createAggregationPipeline(config);
    }

    console.log('Real-time aggregation pipelines setup completed');
  }

  async createAggregationPipeline(config) {
    console.log(`Creating aggregation pipeline: ${config.name}`);

    // Create collection for storing aggregated data
    const aggregationCollection = await this.db.createCollection(config.name, {
      timeseries: {
        timeField: 'timestamp',
        metaField: 'device',
        granularity: config.windowSize.includes('minute') ? 'minutes' : 
                    config.windowSize.includes('hour') ? 'hours' : 'days'
      },
      expireAfterSeconds: config.retentionDays * 24 * 60 * 60
    });

    this.aggregationCollections.set(config.name, {
      collection: aggregationCollection,
      config: config
    });
  }

  async processRealTimeData(collectionName, dataPoints) {
    console.log(`Processing real-time data for ${collectionName}: ${dataPoints.length} points`);

    // Update real-time aggregations
    for (const [aggName, aggInfo] of this.aggregationCollections.entries()) {
      if (aggInfo.config.sourceCollection === collectionName) {
        await this.updateRealTimeAggregation(aggName, dataPoints);
      }
    }

    // Process data through ML pipelines if enabled
    if (this.config.enablePredictiveAnalytics) {
      await this.processPredictiveAnalytics(dataPoints);
    }
  }

  async checkAlertConditions(collectionName, dataPoints) {
    console.log(`Checking alert conditions for ${dataPoints.length} data points`);

    const alertsTriggered = [];

    for (const dataPoint of dataPoints) {
      // Check various alert conditions
      const alerts = [];

      // Value threshold alerts
      if (dataPoint.sensor.type === 'temperature' && dataPoint.value > 80) {
        alerts.push({
          type: 'threshold_violation',
          severity: 'high',
          message: `Temperature ${dataPoint.value}°C exceeds threshold`,
          deviceId: dataPoint.device.id
        });
      }

      // Quality score alerts
      if (dataPoint.quality.score < 0.7) {
        alerts.push({
          type: 'quality_degradation',
          severity: 'medium',
          message: `Quality score ${dataPoint.quality.score} below acceptable level`,
          deviceId: dataPoint.device.id
        });
      }

      // Anomaly alerts
      if (dataPoint.alerts.anomalyDetected) {
        alerts.push({
          type: 'anomaly_detected',
          severity: 'high',
          message: `Anomaly detected in sensor reading`,
          deviceId: dataPoint.device.id
        });
      }

      // Processing latency alerts
      if (dataPoint.processing.processingLatency > 5000) { // 5 seconds
        alerts.push({
          type: 'processing_delay',
          severity: 'medium',
          message: `Processing latency ${dataPoint.processing.processingLatency}ms exceeds threshold`,
          deviceId: dataPoint.device.id
        });
      }

      if (alerts.length > 0) {
        alertsTriggered.push(...alerts);
        this.performanceMetrics.alertsTriggered += alerts.length;
      }
    }

    // Store alerts if any were triggered
    if (alertsTriggered.length > 0) {
      await this.storeAlerts(alertsTriggered);
    }

    return alertsTriggered;
  }

  async storeAlerts(alerts) {
    try {
      // Create alerts collection if it doesn't exist
      if (!this.alertCollections.has('iot_alerts')) {
        const alertsCollection = await this.db.createCollection('iot_alerts');
        await alertsCollection.createIndexes([
          { key: { deviceId: 1, timestamp: -1 }, background: true },
          { key: { severity: 1, timestamp: -1 }, background: true },
          { key: { type: 1, timestamp: -1 }, background: true }
        ]);

        this.alertCollections.set('iot_alerts', alertsCollection);
      }

      const alertsCollection = this.alertCollections.get('iot_alerts');

      const alertDocuments = alerts.map(alert => ({
        ...alert,
        timestamp: new Date(),
        acknowledged: false,
        resolvedAt: null
      }));

      await alertsCollection.insertMany(alertDocuments);

      console.log(`Stored ${alertDocuments.length} alerts`);

    } catch (error) {
      console.error('Error storing alerts:', error);
    }
  }

  updatePerformanceMetrics(operation, count, duration) {
    if (operation === 'insert') {
      this.performanceMetrics.totalDataPoints += count;
      this.performanceMetrics.writeOperationsPerSecond = 
        (count / duration) * 1000;
    } else if (operation === 'query') {
      this.performanceMetrics.queryOperationsPerSecond = 
        (count / duration) * 1000;
    }

    // Update average latency
    this.performanceMetrics.averageLatency = 
      (this.performanceMetrics.averageLatency + duration) / 2;
  }

  async getSystemStatistics() {
    console.log('Gathering IoT Time Series system statistics...');

    const stats = {
      collections: {},
      performance: this.performanceMetrics,
      aggregations: {},
      systemHealth: 'healthy'
    };

    // Gather statistics from each time series collection
    for (const [collectionName, collectionInfo] of this.timeSeriesCollections.entries()) {
      try {
        const collection = collectionInfo.collection;

        const [collectionStats, sampleData] = await Promise.all([
          collection.stats(),
          collection.find().sort({ timestamp: -1 }).limit(1).toArray()
        ]);

        stats.collections[collectionName] = {
          documentCount: collectionStats.count || 0,
          storageSize: collectionStats.size || 0,
          indexSize: collectionStats.totalIndexSize || 0,
          avgDocumentSize: collectionStats.avgObjSize || 0,
          compressionRatio: collectionStats.size > 0 ? 
            (collectionStats.storageSize / collectionStats.size) : 1,
          lastDataPoint: sampleData[0]?.timestamp || null,
          configuration: collectionInfo.config,
          performance: collectionInfo.stats
        };

      } catch (error) {
        stats.collections[collectionName] = {
          error: error.message,
          available: false
        };
      }
    }

    return stats;
  }

  async shutdown() {
    console.log('Shutting down IoT Time Series Manager...');

    // Close change streams
    for (const [streamName, changeStream] of this.changeStreams.entries()) {
      try {
        await changeStream.close();
        console.log(`Closed change stream: ${streamName}`);
      } catch (error) {
        console.error(`Error closing change stream ${streamName}:`, error);
      }
    }

    // Close MongoDB connection
    if (this.client) {
      await this.client.close();
    }

    console.log('IoT Time Series Manager shutdown complete');
  }
}

// Benefits of MongoDB Time Series Collections:
// - Native time series optimization with automatic bucketing and compression
// - Specialized indexing and query optimization for time-based data patterns
// - Efficient storage with automatic data lifecycle management
// - Real-time aggregation pipelines for IoT analytics
// - Built-in support for high-volume write workloads
// - Intelligent compression reducing storage costs by up to 90%
// - Seamless integration with MongoDB's distributed architecture
// - SQL-compatible time series operations through QueryLeaf integration
// - Native support for IoT-specific query patterns and analytics
// - Automatic data archiving and retention management

module.exports = {
  IoTTimeSeriesManager
};

Understanding MongoDB Time Series Collections Architecture

IoT Data Patterns and Optimization Strategies

MongoDB Time Series Collections are specifically designed for the unique characteristics of IoT and time-stamped data:

// Advanced IoT Time Series Processing with Enterprise Features
class EnterpriseIoTProcessor extends IoTTimeSeriesManager {
  constructor(connectionString, enterpriseConfig) {
    super(connectionString, enterpriseConfig);

    this.enterpriseConfig = {
      ...enterpriseConfig,
      enableEdgeComputing: true,
      enablePredictiveAnalytics: true,
      enableDigitalTwins: true,
      enableMLPipelines: true,
      enableAdvancedVisualization: true,
      enableMultiTenancy: true
    };

    this.setupEnterpriseFeatures();
    this.initializeMLPipelines();
    this.setupDigitalTwins();
  }

  async implementAdvancedIoTStrategies() {
    console.log('Implementing enterprise IoT data strategies...');

    const strategies = {
      // Edge computing integration
      edgeComputing: {
        edgeDataAggregation: true,
        intelligentFiltering: true,
        localAnomalyDetection: true,
        bandwidthOptimization: true
      },

      // Predictive analytics
      predictiveAnalytics: {
        equipmentFailurePrediction: true,
        energyOptimization: true,
        maintenanceScheduling: true,
        capacityPlanning: true
      },

      // Digital twin implementation
      digitalTwins: {
        realTimeSimulation: true,
        processOptimization: true,
        scenarioModeling: true,
        performanceAnalytics: true
      }
    };

    return await this.deployEnterpriseIoTStrategies(strategies);
  }

  async setupAdvancedAnalytics() {
    console.log('Setting up advanced IoT analytics capabilities...');

    const analyticsConfig = {
      // Real-time processing
      realTimeProcessing: {
        streamProcessing: true,
        complexEventProcessing: true,
        patternRecognition: true,
        correlationAnalysis: true
      },

      // Machine learning integration
      machineLearning: {
        anomalyDetection: true,
        predictiveModeling: true,
        classificationModels: true,
        reinforcementLearning: true
      },

      // Advanced visualization
      visualization: {
        realTimeDashboards: true,
        historicalAnalytics: true,
        geospatialVisualization: true,
        threeDimensionalModeling: true
      }
    };

    return await this.deployAdvancedAnalytics(analyticsConfig);
  }
}

SQL-Style Time Series Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB Time Series operations and IoT data management:

-- QueryLeaf Time Series operations with SQL-familiar syntax for IoT data

-- Create optimized time series collections for different IoT data types
CREATE TIME_SERIES_COLLECTION sensor_readings (
  timestamp TIMESTAMPTZ,
  device_id STRING,
  sensor_type STRING,
  value DECIMAL,
  quality_score DECIMAL,
  location OBJECT,
  metadata OBJECT
)
WITH (
  time_field = 'timestamp',
  meta_field = 'device_metadata',
  granularity = 'seconds',
  bucket_max_span_seconds = 3600,
  bucket_rounding_seconds = 60,
  expire_after_seconds = 31536000, -- 1 year retention
  enable_compression = true,
  compression_algorithm = 'zstd'
);

-- Create specialized collections for different IoT use cases
CREATE TIME_SERIES_COLLECTION equipment_telemetry (
  timestamp TIMESTAMPTZ,
  equipment_id STRING,
  metric_type STRING,
  value DECIMAL,
  operational_status STRING,
  maintenance_flags ARRAY,
  performance_indicators OBJECT
)
WITH (
  granularity = 'seconds',
  bucket_max_span_seconds = 1800, -- 30 minute buckets for high-frequency data
  enable_automatic_indexing = true
);

CREATE TIME_SERIES_COLLECTION environmental_monitoring (
  timestamp TIMESTAMPTZ,
  location_id STRING,
  sensor_network STRING,
  measurements OBJECT,
  weather_conditions OBJECT,
  air_quality_index DECIMAL
)
WITH (
  granularity = 'minutes',
  bucket_max_span_seconds = 7200, -- 2 hour buckets for environmental data
  enable_predictive_analytics = true
);

-- Advanced IoT data insertion with comprehensive metadata
INSERT INTO sensor_readings (
  timestamp, device_metadata, sensor_info, measurements, quality_metrics, context
)
WITH iot_data_enrichment AS (
  SELECT 
    reading_timestamp as timestamp,

    -- Device and location metadata (optimized as metaField)
    JSON_OBJECT(
      'device_id', device_identifier,
      'device_type', equipment_type,
      'firmware_version', firmware_ver,
      'location', JSON_OBJECT(
        'facility', facility_name,
        'zone', zone_identifier,
        'coordinates', JSON_OBJECT('lat', latitude, 'lng', longitude),
        'floor', floor_number,
        'room', room_identifier
      ),
      'network_info', JSON_OBJECT(
        'gateway_id', gateway_identifier,
        'signal_strength', rssi_value,
        'protocol', communication_protocol,
        'network_latency', network_delay_ms
      )
    ) as device_metadata,

    -- Sensor information and calibration data
    JSON_OBJECT(
      'sensor_type', sensor_category,
      'model_number', sensor_model,
      'serial_number', sensor_serial,
      'calibration_date', last_calibration,
      'maintenance_schedule', maintenance_interval,
      'measurement_unit', measurement_units,
      'precision', sensor_precision,
      'accuracy', sensor_accuracy_percent,
      'operating_range', JSON_OBJECT(
        'min_value', minimum_measurable,
        'max_value', maximum_measurable,
        'optimal_range', optimal_operating_range
      )
    ) as sensor_info,

    -- Measurement data with statistical context
    JSON_OBJECT(
      'primary_value', sensor_reading,
      'raw_value', unprocessed_reading,
      'calibrated_value', calibration_adjusted_value,
      'statistical_context', JSON_OBJECT(
        'recent_average', rolling_average_10min,
        'recent_min', rolling_min_10min,
        'recent_max', rolling_max_10min,
        'trend_indicator', trend_direction,
        'volatility_index', measurement_volatility
      ),
      'related_measurements', JSON_OBJECT(
        'secondary_sensors', related_sensor_readings,
        'environmental_factors', ambient_conditions,
        'operational_context', equipment_operating_mode
      )
    ) as measurements,

    -- Comprehensive quality assessment
    JSON_OBJECT(
      'overall_score', data_quality_score,
      'confidence_level', measurement_confidence,
      'quality_factors', JSON_OBJECT(
        'sensor_health', sensor_status_indicator,
        'calibration_validity', calibration_status,
        'environmental_conditions', environmental_suitability,
        'signal_integrity', signal_quality_assessment,
        'power_status', power_supply_stability
      ),
      'quality_flags', quality_warning_flags,
      'anomaly_indicators', JSON_OBJECT(
        'statistical_anomaly', statistical_outlier_flag,
        'temporal_anomaly', temporal_pattern_anomaly,
        'contextual_anomaly', contextual_deviation_flag,
        'severity_level', anomaly_severity_rating
      )
    ) as quality_metrics,

    -- Business and operational context
    JSON_OBJECT(
      'business_context', JSON_OBJECT(
        'asset_id', primary_asset_identifier,
        'process_id', manufacturing_process_id,
        'production_batch', current_batch_identifier,
        'shift_information', JSON_OBJECT(
          'shift_id', current_shift,
          'operator_id', responsible_operator,
          'supervisor_id', shift_supervisor
        )
      ),
      'operational_context', JSON_OBJECT(
        'equipment_mode', current_operational_mode,
        'production_rate', current_production_speed,
        'efficiency_metrics', operational_efficiency_data,
        'maintenance_status', equipment_maintenance_state,
        'compliance_flags', regulatory_compliance_status
      ),
      'alert_configuration', JSON_OBJECT(
        'threshold_settings', alert_threshold_values,
        'notification_rules', alert_notification_config,
        'escalation_procedures', alert_escalation_rules,
        'suppression_conditions', alert_suppression_rules
      )
    ) as context

  FROM raw_iot_data_stream
  WHERE 
    data_timestamp >= CURRENT_TIMESTAMP - INTERVAL '5 minutes'
    AND data_quality_preliminary >= 0.5
    AND device_status != 'maintenance_mode'
)
SELECT 
  timestamp,
  device_metadata,
  sensor_info,
  measurements,
  quality_metrics,
  context,

  -- Processing metadata
  JSON_OBJECT(
    'ingestion_timestamp', CURRENT_TIMESTAMP,
    'processing_latency_ms', 
      EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - timestamp)) * 1000,
    'data_pipeline_version', '2.1.0',
    'enrichment_applied', JSON_ARRAY(
      'metadata_enhancement',
      'quality_assessment',
      'anomaly_detection',
      'contextual_enrichment'
    )
  ) as processing_metadata

FROM iot_data_enrichment
WHERE 
  -- Final data quality validation
  JSON_EXTRACT(quality_metrics, '$.overall_score') >= 0.6
  AND JSON_EXTRACT(measurements, '$.primary_value') IS NOT NULL

ORDER BY timestamp;

-- Real-time IoT analytics with time-based aggregations
WITH real_time_sensor_analytics AS (
  SELECT 
    DATE_TRUNC('minute', timestamp) as time_bucket,
    JSON_EXTRACT(device_metadata, '$.device_id') as device_id,
    JSON_EXTRACT(device_metadata, '$.device_type') as device_type,
    JSON_EXTRACT(device_metadata, '$.location.facility') as facility,
    JSON_EXTRACT(device_metadata, '$.location.zone') as zone,
    JSON_EXTRACT(sensor_info, '$.sensor_type') as sensor_type,

    -- Statistical aggregations optimized for time series
    COUNT(*) as reading_count,
    AVG(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as avg_value,
    MIN(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as min_value,
    MAX(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as max_value,
    STDDEV(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as stddev_value,

    -- Percentile calculations for distribution analysis
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as p25_value,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as median_value,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as p75_value,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as p95_value,

    -- Data quality aggregations
    AVG(JSON_EXTRACT(quality_metrics, '$.overall_score')::DECIMAL) as avg_quality_score,
    MIN(JSON_EXTRACT(quality_metrics, '$.overall_score')::DECIMAL) as min_quality_score,
    COUNT(*) FILTER (WHERE JSON_EXTRACT(quality_metrics, '$.overall_score')::DECIMAL < 0.8) as low_quality_readings,

    -- Anomaly detection aggregations
    COUNT(*) FILTER (WHERE JSON_EXTRACT(quality_metrics, '$.anomaly_indicators.statistical_anomaly')::BOOLEAN = true) as statistical_anomalies,
    COUNT(*) FILTER (WHERE JSON_EXTRACT(quality_metrics, '$.anomaly_indicators.temporal_anomaly')::BOOLEAN = true) as temporal_anomalies,
    COUNT(*) FILTER (WHERE JSON_EXTRACT(quality_metrics, '$.anomaly_indicators.contextual_anomaly')::BOOLEAN = true) as contextual_anomalies,

    -- Value change and trend analysis
    (MAX(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) - 
     MIN(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL)) as value_range,

    -- Time coverage and sampling analysis
    (EXTRACT(EPOCH FROM MAX(timestamp) - MIN(timestamp))) as time_span_seconds,
    COUNT(*)::DECIMAL / GREATEST(1, EXTRACT(EPOCH FROM MAX(timestamp) - MIN(timestamp)) / 60) as readings_per_minute,

    -- Processing performance metrics
    AVG(JSON_EXTRACT(processing_metadata, '$.processing_latency_ms')::DECIMAL) as avg_processing_latency,
    MAX(JSON_EXTRACT(processing_metadata, '$.processing_latency_ms')::DECIMAL) as max_processing_latency,

    -- Network performance indicators
    AVG(JSON_EXTRACT(device_metadata, '$.network_info.network_latency')::DECIMAL) as avg_network_latency,
    AVG(JSON_EXTRACT(device_metadata, '$.network_info.signal_strength')::DECIMAL) as avg_signal_strength,

    -- Environmental context aggregations
    AVG(JSON_EXTRACT(measurements, '$.related_measurements.environmental_factors.temperature')::DECIMAL) as avg_ambient_temp,
    AVG(JSON_EXTRACT(measurements, '$.related_measurements.environmental_factors.humidity')::DECIMAL) as avg_ambient_humidity,

    -- Operational context
    MODE() WITHIN GROUP (ORDER BY JSON_EXTRACT(context, '$.operational_context.equipment_mode')::STRING) as primary_equipment_mode,
    AVG(JSON_EXTRACT(context, '$.operational_context.production_rate')::DECIMAL) as avg_production_rate

  FROM sensor_readings
  WHERE 
    timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
    AND JSON_EXTRACT(quality_metrics, '$.overall_score')::DECIMAL >= 0.5
  GROUP BY 
    time_bucket, device_id, device_type, facility, zone, sensor_type
),

performance_analysis AS (
  SELECT 
    rtsa.*,

    -- Data quality assessment
    ROUND((rtsa.reading_count - rtsa.low_quality_readings)::DECIMAL / rtsa.reading_count * 100, 2) as quality_percentage,

    -- Anomaly rate calculations
    ROUND((rtsa.statistical_anomalies + rtsa.temporal_anomalies + rtsa.contextual_anomalies)::DECIMAL / rtsa.reading_count * 100, 2) as total_anomaly_rate,

    -- Statistical analysis
    CASE 
      WHEN rtsa.avg_value != 0 THEN ROUND(rtsa.stddev_value / ABS(rtsa.avg_value), 4)
      ELSE 0
    END as coefficient_of_variation,

    -- Data completeness analysis (expected vs actual readings)
    ROUND(rtsa.reading_count / GREATEST(1, rtsa.time_span_seconds / 60) * 100, 1) as data_completeness_percent,

    -- Performance classification
    CASE 
      WHEN rtsa.avg_quality_score >= 0.95 AND rtsa.total_anomaly_rate <= 1 THEN 'excellent'
      WHEN rtsa.avg_quality_score >= 0.85 AND rtsa.total_anomaly_rate <= 5 THEN 'good'
      WHEN rtsa.avg_quality_score >= 0.70 AND rtsa.total_anomaly_rate <= 10 THEN 'acceptable'
      ELSE 'poor'
    END as performance_category,

    -- Trend analysis
    CASE 
      WHEN rtsa.coefficient_of_variation > 0.5 THEN 'highly_variable'
      WHEN rtsa.coefficient_of_variation > 0.3 THEN 'variable'
      WHEN rtsa.coefficient_of_variation > 0.1 THEN 'moderate'
      ELSE 'stable'
    END as stability_classification,

    -- Alert conditions
    CASE 
      WHEN rtsa.avg_quality_score < 0.7 THEN 'quality_alert'
      WHEN rtsa.total_anomaly_rate > 15 THEN 'anomaly_alert'
      WHEN rtsa.avg_processing_latency > 5000 THEN 'latency_alert'
      WHEN rtsa.data_completeness_percent < 80 THEN 'data_gap_alert'
      WHEN ABS(rtsa.avg_signal_strength) < -80 THEN 'connectivity_alert'
      ELSE 'normal'
    END as alert_status,

    -- Operational efficiency indicators
    CASE 
      WHEN rtsa.primary_equipment_mode = 'production' AND rtsa.avg_production_rate >= 95 THEN 'optimal_efficiency'
      WHEN rtsa.primary_equipment_mode = 'production' AND rtsa.avg_production_rate >= 80 THEN 'good_efficiency'
      WHEN rtsa.primary_equipment_mode = 'production' AND rtsa.avg_production_rate >= 60 THEN 'reduced_efficiency'
      WHEN rtsa.primary_equipment_mode = 'maintenance' THEN 'maintenance_mode'
      ELSE 'unknown_efficiency'
    END as operational_efficiency,

    -- Time-based patterns
    EXTRACT(HOUR FROM rtsa.time_bucket) as hour_of_day,
    EXTRACT(DOW FROM rtsa.time_bucket) as day_of_week

  FROM real_time_sensor_analytics rtsa
),

device_health_assessment AS (
  SELECT 
    pa.device_id,
    pa.device_type,
    pa.facility,
    pa.zone,
    pa.sensor_type,

    -- Current status indicators
    LAST_VALUE(pa.performance_category) OVER (
      PARTITION BY pa.device_id 
      ORDER BY pa.time_bucket 
      ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) as current_performance_status,

    LAST_VALUE(pa.alert_status) OVER (
      PARTITION BY pa.device_id 
      ORDER BY pa.time_bucket 
      ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) as current_alert_status,

    -- Performance trends over the analysis window
    COUNT(*) as analysis_periods,
    COUNT(*) FILTER (WHERE pa.performance_category IN ('excellent', 'good')) as good_periods,
    COUNT(*) FILTER (WHERE pa.alert_status != 'normal') as alert_periods,

    -- Average performance metrics
    ROUND(AVG(pa.avg_quality_score), 3) as overall_avg_quality,
    ROUND(AVG(pa.total_anomaly_rate), 2) as overall_anomaly_rate,
    ROUND(AVG(pa.readings_per_minute), 2) as overall_data_rate,
    ROUND(AVG(pa.avg_processing_latency), 0) as overall_processing_latency,

    -- Stability and consistency
    ROUND(AVG(pa.coefficient_of_variation), 4) as average_stability_index,
    ROUND(AVG(pa.data_completeness_percent), 1) as average_data_completeness,

    -- Network and connectivity
    ROUND(AVG(pa.avg_network_latency), 0) as average_network_latency,
    ROUND(AVG(pa.avg_signal_strength), 1) as average_signal_strength,

    -- Environmental context
    ROUND(AVG(pa.avg_ambient_temp), 1) as average_ambient_temperature,
    ROUND(AVG(pa.avg_ambient_humidity), 1) as average_ambient_humidity,

    -- Operational efficiency
    MODE() WITHIN GROUP (ORDER BY pa.operational_efficiency) as predominant_efficiency_level,

    -- Value statistics across all time periods
    ROUND(AVG(pa.avg_value), 4) as overall_average_value,
    ROUND(AVG(pa.stddev_value), 4) as overall_value_variability,
    MIN(pa.min_value) as absolute_minimum_value,
    MAX(pa.max_value) as absolute_maximum_value

  FROM performance_analysis pa
  GROUP BY 
    pa.device_id, pa.device_type, pa.facility, pa.zone, pa.sensor_type
)

-- Comprehensive IoT device health and performance report
SELECT 
  dha.device_id,
  dha.device_type,
  dha.sensor_type,
  dha.facility,
  dha.zone,

  -- Current status
  dha.current_performance_status,
  dha.current_alert_status,

  -- Performance summary
  dha.overall_avg_quality as quality_score,
  dha.overall_anomaly_rate as anomaly_rate_percent,
  dha.overall_data_rate as readings_per_minute,
  dha.overall_processing_latency as avg_latency_ms,

  -- Reliability indicators
  ROUND((dha.good_periods::DECIMAL / dha.analysis_periods) * 100, 1) as uptime_percentage,
  ROUND((dha.alert_periods::DECIMAL / dha.analysis_periods) * 100, 1) as alert_percentage,
  dha.average_data_completeness as data_completeness_percent,

  -- Performance classification
  CASE 
    WHEN dha.overall_avg_quality >= 0.9 AND dha.overall_anomaly_rate <= 2 AND dha.uptime_percentage >= 95 THEN 'optimal'
    WHEN dha.overall_avg_quality >= 0.8 AND dha.overall_anomaly_rate <= 5 AND dha.uptime_percentage >= 90 THEN 'good'
    WHEN dha.overall_avg_quality >= 0.6 AND dha.overall_anomaly_rate <= 10 AND dha.uptime_percentage >= 80 THEN 'acceptable'
    WHEN dha.overall_avg_quality < 0.5 OR dha.overall_anomaly_rate > 20 THEN 'critical'
    ELSE 'needs_attention'
  END as device_health_classification,

  -- Operational context
  dha.predominant_efficiency_level,
  dha.overall_average_value as typical_reading_value,
  dha.overall_value_variability as measurement_stability,

  -- Environmental factors
  dha.average_ambient_temperature,
  dha.average_ambient_humidity,

  -- Connectivity and infrastructure
  dha.average_network_latency as network_latency_ms,
  dha.average_signal_strength as signal_strength_dbm,

  -- Recommendations and next actions
  CASE 
    WHEN dha.current_alert_status = 'quality_alert' THEN 'calibrate_sensor_immediate'
    WHEN dha.current_alert_status = 'anomaly_alert' THEN 'investigate_anomaly_patterns'
    WHEN dha.current_alert_status = 'latency_alert' THEN 'optimize_data_pipeline'
    WHEN dha.current_alert_status = 'connectivity_alert' THEN 'check_network_infrastructure'
    WHEN dha.current_alert_status = 'data_gap_alert' THEN 'verify_sensor_connectivity'
    WHEN dha.overall_avg_quality < 0.8 THEN 'schedule_maintenance'
    WHEN dha.overall_anomaly_rate > 10 THEN 'review_operating_conditions'
    WHEN dha.uptime_percentage < 90 THEN 'improve_system_reliability'
    ELSE 'continue_monitoring'
  END as recommended_action,

  -- Priority level for action
  CASE 
    WHEN dha.device_health_classification = 'critical' THEN 'immediate'
    WHEN dha.device_health_classification = 'needs_attention' THEN 'high'
    WHEN dha.current_alert_status != 'normal' THEN 'medium'
    WHEN dha.device_health_classification = 'acceptable' THEN 'low'
    ELSE 'routine'
  END as action_priority

FROM device_health_assessment dha
ORDER BY 
  CASE action_priority
    WHEN 'immediate' THEN 1
    WHEN 'high' THEN 2  
    WHEN 'medium' THEN 3
    WHEN 'low' THEN 4
    ELSE 5
  END,
  dha.overall_anomaly_rate DESC,
  dha.overall_avg_quality ASC;

-- Time series forecasting and predictive analytics
WITH historical_patterns AS (
  SELECT 
    JSON_EXTRACT(device_metadata, '$.device_id') as device_id,
    JSON_EXTRACT(sensor_info, '$.sensor_type') as sensor_type,
    DATE_TRUNC('hour', timestamp) as hour_bucket,

    AVG(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as hourly_avg_value,
    COUNT(*) as readings_count,
    MIN(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as hourly_min,
    MAX(JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL) as hourly_max,

    -- Time-based features for forecasting
    EXTRACT(HOUR FROM timestamp) as hour_of_day,
    EXTRACT(DOW FROM timestamp) as day_of_week,
    EXTRACT(DAY FROM timestamp) as day_of_month,

    -- Seasonal indicators
    CASE 
      WHEN EXTRACT(MONTH FROM timestamp) IN (12, 1, 2) THEN 'winter'
      WHEN EXTRACT(MONTH FROM timestamp) IN (3, 4, 5) THEN 'spring'
      WHEN EXTRACT(MONTH FROM timestamp) IN (6, 7, 8) THEN 'summer'
      ELSE 'autumn'
    END as season,

    -- Operational context features
    MODE() WITHIN GROUP (ORDER BY JSON_EXTRACT(context, '$.operational_context.equipment_mode')) as predominant_mode

  FROM sensor_readings
  WHERE 
    timestamp >= CURRENT_TIMESTAMP - INTERVAL '30 days'
    AND JSON_EXTRACT(quality_metrics, '$.overall_score')::DECIMAL >= 0.8
  GROUP BY 
    device_id, sensor_type, hour_bucket, hour_of_day, day_of_week, day_of_month, season
),

trend_analysis AS (
  SELECT 
    hp.*,

    -- Moving averages for trend analysis
    AVG(hp.hourly_avg_value) OVER (
      PARTITION BY hp.device_id, hp.sensor_type 
      ORDER BY hp.hour_bucket 
      ROWS BETWEEN 23 PRECEDING AND CURRENT ROW
    ) as moving_avg_24h,

    AVG(hp.hourly_avg_value) OVER (
      PARTITION BY hp.device_id, hp.sensor_type 
      ORDER BY hp.hour_bucket 
      ROWS BETWEEN 167 PRECEDING AND CURRENT ROW  -- 7 days * 24 hours
    ) as moving_avg_7d,

    -- Lag values for change detection
    LAG(hp.hourly_avg_value, 1) OVER (
      PARTITION BY hp.device_id, hp.sensor_type 
      ORDER BY hp.hour_bucket
    ) as prev_hour_value,

    LAG(hp.hourly_avg_value, 24) OVER (
      PARTITION BY hp.device_id, hp.sensor_type 
      ORDER BY hp.hour_bucket
    ) as prev_day_same_hour_value,

    LAG(hp.hourly_avg_value, 168) OVER (
      PARTITION BY hp.device_id, hp.sensor_type 
      ORDER BY hp.hour_bucket
    ) as prev_week_same_hour_value,

    -- Seasonal comparison
    AVG(hp.hourly_avg_value) OVER (
      PARTITION BY hp.device_id, hp.sensor_type, hp.hour_of_day, hp.day_of_week
      ORDER BY hp.hour_bucket
      ROWS BETWEEN 672 PRECEDING AND 672 PRECEDING  -- 4 weeks ago, same hour/day
    ) as seasonal_baseline

  FROM historical_patterns hp
),

predictive_indicators AS (
  SELECT 
    ta.*,

    -- Change calculations
    COALESCE(ta.hourly_avg_value - ta.prev_hour_value, 0) as hourly_change,
    COALESCE(ta.hourly_avg_value - ta.prev_day_same_hour_value, 0) as daily_change,
    COALESCE(ta.hourly_avg_value - ta.prev_week_same_hour_value, 0) as weekly_change,
    COALESCE(ta.hourly_avg_value - ta.seasonal_baseline, 0) as seasonal_deviation,

    -- Trend direction indicators
    CASE 
      WHEN ta.hourly_avg_value > ta.moving_avg_24h * 1.05 THEN 'upward'
      WHEN ta.hourly_avg_value < ta.moving_avg_24h * 0.95 THEN 'downward'  
      ELSE 'stable'
    END as short_term_trend,

    CASE 
      WHEN ta.moving_avg_24h > ta.moving_avg_7d * 1.02 THEN 'increasing'
      WHEN ta.moving_avg_24h < ta.moving_avg_7d * 0.98 THEN 'decreasing'
      ELSE 'steady'
    END as long_term_trend,

    -- Volatility measures
    ABS(ta.hourly_avg_value - ta.moving_avg_24h) / NULLIF(ta.moving_avg_24h, 0) as relative_volatility,

    -- Anomaly scoring
    CASE 
      WHEN ABS(ta.hourly_avg_value - ta.seasonal_baseline) > (ta.moving_avg_7d * 0.3) THEN 'high_anomaly'
      WHEN ABS(ta.hourly_avg_value - ta.seasonal_baseline) > (ta.moving_avg_7d * 0.15) THEN 'moderate_anomaly'
      WHEN ABS(ta.hourly_avg_value - ta.seasonal_baseline) > (ta.moving_avg_7d * 0.05) THEN 'low_anomaly'
      ELSE 'normal'
    END as anomaly_level,

    -- Predictive risk assessment
    CASE 
      WHEN ta.short_term_trend = 'upward' AND ta.long_term_trend = 'increasing' AND ta.relative_volatility > 0.2 THEN 'high_risk'
      WHEN ta.short_term_trend = 'downward' AND ta.long_term_trend = 'decreasing' AND ta.relative_volatility > 0.15 THEN 'high_risk'
      WHEN ta.relative_volatility > 0.25 THEN 'moderate_risk'
      WHEN ta.anomaly_level IN ('high_anomaly', 'moderate_anomaly') THEN 'moderate_risk'
      ELSE 'low_risk'
    END as predictive_risk_level

  FROM trend_analysis ta
  WHERE ta.hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '7 days'
)

-- Predictive analytics and forecasting results
SELECT 
  pi.device_id,
  pi.sensor_type,
  pi.hour_bucket,

  -- Current values and trends
  ROUND(pi.hourly_avg_value, 4) as current_value,
  ROUND(pi.moving_avg_24h, 4) as trend_24h,
  ROUND(pi.moving_avg_7d, 4) as trend_7d,

  -- Change analysis
  ROUND(pi.hourly_change, 4) as hour_to_hour_change,
  ROUND(pi.daily_change, 4) as day_to_day_change,
  ROUND(pi.weekly_change, 4) as week_to_week_change,
  ROUND(pi.seasonal_deviation, 4) as seasonal_variance,

  -- Trend classification
  pi.short_term_trend,
  pi.long_term_trend,
  pi.anomaly_level,
  pi.predictive_risk_level,

  -- Risk indicators
  ROUND(pi.relative_volatility * 100, 2) as volatility_percent,

  -- Simple linear forecast (next hour prediction)
  ROUND(
    pi.hourly_avg_value + 
    (COALESCE(pi.hourly_change, 0) * 0.7) + 
    (COALESCE(pi.daily_change, 0) * 0.2) + 
    (COALESCE(pi.weekly_change, 0) * 0.1), 
    4
  ) as predicted_next_hour_value,

  -- Confidence level for prediction
  CASE 
    WHEN pi.relative_volatility < 0.05 AND pi.anomaly_level = 'normal' THEN 'high'
    WHEN pi.relative_volatility < 0.15 AND pi.anomaly_level IN ('normal', 'low_anomaly') THEN 'medium'
    WHEN pi.relative_volatility < 0.30 THEN 'low'
    ELSE 'very_low'
  END as prediction_confidence,

  -- Maintenance and operational recommendations
  CASE 
    WHEN pi.predictive_risk_level = 'high_risk' THEN 'schedule_immediate_inspection'
    WHEN pi.anomaly_level = 'high_anomaly' THEN 'investigate_root_cause'
    WHEN pi.long_term_trend = 'decreasing' AND pi.sensor_type = 'efficiency' THEN 'schedule_maintenance'
    WHEN pi.relative_volatility > 0.2 THEN 'check_sensor_calibration'
    WHEN pi.short_term_trend != pi.long_term_trend THEN 'monitor_closely'
    ELSE 'continue_routine_monitoring'
  END as maintenance_recommendation

FROM predictive_indicators pi
WHERE pi.hour_bucket >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
ORDER BY 
  CASE pi.predictive_risk_level
    WHEN 'high_risk' THEN 1
    WHEN 'moderate_risk' THEN 2
    ELSE 3
  END,
  pi.relative_volatility DESC,
  pi.device_id,
  pi.hour_bucket DESC;

-- Real-time alerting and notification system
WITH real_time_monitoring AS (
  SELECT 
    JSON_EXTRACT(device_metadata, '$.device_id') as device_id,
    JSON_EXTRACT(device_metadata, '$.device_type') as device_type,
    JSON_EXTRACT(device_metadata, '$.location.facility') as facility,
    JSON_EXTRACT(sensor_info, '$.sensor_type') as sensor_type,
    timestamp,
    JSON_EXTRACT(measurements, '$.primary_value')::DECIMAL as current_value,
    JSON_EXTRACT(quality_metrics, '$.overall_score')::DECIMAL as quality_score,

    -- Alert thresholds from configuration
    JSON_EXTRACT(context, '$.alert_configuration.threshold_settings.critical_high')::DECIMAL as critical_high_threshold,
    JSON_EXTRACT(context, '$.alert_configuration.threshold_settings.critical_low')::DECIMAL as critical_low_threshold,
    JSON_EXTRACT(context, '$.alert_configuration.threshold_settings.warning_high')::DECIMAL as warning_high_threshold,
    JSON_EXTRACT(context, '$.alert_configuration.threshold_settings.warning_low')::DECIMAL as warning_low_threshold,

    -- Quality thresholds
    JSON_EXTRACT(context, '$.alert_configuration.threshold_settings.min_quality_score')::DECIMAL as min_quality_threshold,

    -- Anomaly flags
    JSON_EXTRACT(quality_metrics, '$.anomaly_indicators.statistical_anomaly')::BOOLEAN as statistical_anomaly,
    JSON_EXTRACT(quality_metrics, '$.anomaly_indicators.temporal_anomaly')::BOOLEAN as temporal_anomaly,
    JSON_EXTRACT(quality_metrics, '$.anomaly_indicators.contextual_anomaly')::BOOLEAN as contextual_anomaly,

    -- Processing performance
    JSON_EXTRACT(processing_metadata, '$.processing_latency_ms')::DECIMAL as processing_latency,
    JSON_EXTRACT(device_metadata, '$.network_info.network_latency')::DECIMAL as network_latency

  FROM sensor_readings
  WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '5 minutes'
),

alert_evaluation AS (
  SELECT 
    rtm.*,

    -- Value-based alerts
    CASE 
      WHEN rtm.current_value >= rtm.critical_high_threshold THEN 'critical_high_value'
      WHEN rtm.current_value <= rtm.critical_low_threshold THEN 'critical_low_value'
      WHEN rtm.current_value >= rtm.warning_high_threshold THEN 'warning_high_value'
      WHEN rtm.current_value <= rtm.warning_low_threshold THEN 'warning_low_value'
      ELSE null
    END as value_alert_type,

    -- Quality-based alerts
    CASE 
      WHEN rtm.quality_score < rtm.min_quality_threshold THEN 'quality_degradation'
      ELSE null
    END as quality_alert_type,

    -- Anomaly-based alerts
    CASE 
      WHEN rtm.statistical_anomaly = true THEN 'statistical_anomaly_detected'
      WHEN rtm.temporal_anomaly = true THEN 'temporal_pattern_anomaly'
      WHEN rtm.contextual_anomaly = true THEN 'contextual_anomaly_detected'
      ELSE null
    END as anomaly_alert_type,

    -- Performance-based alerts
    CASE 
      WHEN rtm.processing_latency > 5000 THEN 'high_processing_latency'
      WHEN rtm.network_latency > 2000 THEN 'high_network_latency'
      ELSE null
    END as performance_alert_type,

    -- Severity calculation
    CASE 
      WHEN rtm.current_value >= rtm.critical_high_threshold OR rtm.current_value <= rtm.critical_low_threshold THEN 'critical'
      WHEN rtm.quality_score < (rtm.min_quality_threshold * 0.7) THEN 'critical'
      WHEN rtm.statistical_anomaly = true OR rtm.temporal_anomaly = true THEN 'high'
      WHEN rtm.current_value >= rtm.warning_high_threshold OR rtm.current_value <= rtm.warning_low_threshold THEN 'medium'
      WHEN rtm.quality_score < rtm.min_quality_threshold THEN 'medium'
      WHEN rtm.contextual_anomaly = true OR rtm.processing_latency > 5000 THEN 'low'
      ELSE null
    END as alert_severity

  FROM real_time_monitoring rtm
),

active_alerts AS (
  SELECT 
    ae.device_id,
    ae.device_type,
    ae.facility,
    ae.sensor_type,
    ae.timestamp as alert_timestamp,
    ae.current_value,
    ae.quality_score,

    -- Consolidate all alert types
    COALESCE(ae.value_alert_type, ae.quality_alert_type, ae.anomaly_alert_type, ae.performance_alert_type) as alert_type,
    ae.alert_severity,

    -- Alert context
    JSON_OBJECT(
      'current_reading', ae.current_value,
      'quality_score', ae.quality_score,
      'thresholds', JSON_OBJECT(
        'critical_high', ae.critical_high_threshold,
        'critical_low', ae.critical_low_threshold,
        'warning_high', ae.warning_high_threshold,
        'warning_low', ae.warning_low_threshold
      ),
      'anomaly_indicators', JSON_OBJECT(
        'statistical_anomaly', ae.statistical_anomaly,
        'temporal_anomaly', ae.temporal_anomaly,
        'contextual_anomaly', ae.contextual_anomaly
      ),
      'performance_metrics', JSON_OBJECT(
        'processing_latency_ms', ae.processing_latency,
        'network_latency_ms', ae.network_latency
      )
    ) as alert_context,

    -- Notification urgency
    CASE 
      WHEN ae.alert_severity = 'critical' THEN 'immediate'
      WHEN ae.alert_severity = 'high' THEN 'within_15_minutes'
      WHEN ae.alert_severity = 'medium' THEN 'within_1_hour'
      ELSE 'next_business_day'
    END as notification_urgency,

    -- Recommended actions
    CASE 
      WHEN ae.value_alert_type IN ('critical_high_value', 'critical_low_value') THEN 'emergency_shutdown_consider'
      WHEN ae.quality_alert_type = 'quality_degradation' THEN 'sensor_maintenance_required'
      WHEN ae.anomaly_alert_type IN ('statistical_anomaly_detected', 'temporal_pattern_anomaly') THEN 'investigate_anomaly_cause'
      WHEN ae.performance_alert_type = 'high_processing_latency' THEN 'check_system_resources'
      WHEN ae.performance_alert_type = 'high_network_latency' THEN 'check_network_connectivity'
      ELSE 'standard_investigation'
    END as recommended_action

  FROM alert_evaluation ae
  WHERE ae.alert_severity IS NOT NULL
)

-- Active alerts requiring immediate attention
SELECT 
  aa.alert_timestamp,
  aa.device_id,
  aa.device_type,
  aa.sensor_type,
  aa.facility,

  -- Alert details
  aa.alert_type,
  aa.alert_severity,
  aa.notification_urgency,
  aa.recommended_action,

  -- Current status
  aa.current_value as current_reading,
  aa.quality_score as current_quality,

  -- Alert context for operators
  aa.alert_context,

  -- Time since alert
  EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - aa.alert_timestamp)) / 60 as minutes_since_alert,

  -- Business impact assessment
  CASE 
    WHEN aa.alert_severity = 'critical' AND aa.device_type = 'safety_system' THEN 'safety_risk'
    WHEN aa.alert_severity = 'critical' AND aa.device_type = 'production_equipment' THEN 'production_impact'
    WHEN aa.alert_severity IN ('critical', 'high') AND aa.sensor_type = 'environmental' THEN 'compliance_risk'
    WHEN aa.alert_severity IN ('critical', 'high') THEN 'operational_impact'
    ELSE 'monitoring_required'
  END as business_impact_level,

  -- Next steps for operators
  JSON_OBJECT(
    'immediate_action', aa.recommended_action,
    'escalation_required', 
      CASE aa.alert_severity 
        WHEN 'critical' THEN true 
        ELSE false 
      END,
    'estimated_resolution_time', 
      CASE aa.alert_type
        WHEN 'quality_degradation' THEN '30-60 minutes'
        WHEN 'statistical_anomaly_detected' THEN '1-4 hours'
        WHEN 'critical_high_value' THEN '15-30 minutes'
        WHEN 'critical_low_value' THEN '15-30 minutes'
        ELSE '1-2 hours'
      END,
    'required_expertise', 
      CASE aa.alert_type
        WHEN 'quality_degradation' THEN 'maintenance_technician'
        WHEN 'statistical_anomaly_detected' THEN 'process_engineer'
        WHEN 'high_processing_latency' THEN 'it_support'
        WHEN 'high_network_latency' THEN 'network_administrator'
        ELSE 'operations_supervisor'
      END
  ) as operational_guidance

FROM active_alerts aa
ORDER BY 
  CASE aa.alert_severity
    WHEN 'critical' THEN 1
    WHEN 'high' THEN 2
    WHEN 'medium' THEN 3
    ELSE 4
  END,
  aa.alert_timestamp DESC;

-- QueryLeaf provides comprehensive IoT time series capabilities:
-- 1. SQL-familiar time series collection creation and optimization
-- 2. High-performance IoT data ingestion with automatic bucketing
-- 3. Real-time analytics and aggregation for sensor data
-- 4. Predictive analytics and trend analysis
-- 5. Comprehensive anomaly detection and alerting
-- 6. Performance monitoring and health assessment
-- 7. Integration with MongoDB's native time series optimizations
-- 8. Enterprise-ready IoT data management with familiar SQL syntax
-- 9. Automatic data lifecycle management and archiving
-- 10. Production-ready scalability for high-volume IoT workloads

Best Practices for Time Series Implementation

IoT Data Architecture and Performance Optimization

Essential principles for effective MongoDB Time Series deployment in IoT environments:

Collection Design: Create purpose-built time series collections with optimal bucketing strategies for different sensor types and data frequencies
Metadata Strategy: Design comprehensive metadata schemas that enable efficient filtering and provide rich context for analytics
Ingestion Optimization: Implement batch ingestion patterns and write concern configurations optimized for IoT write workloads
Query Patterns: Design aggregation pipelines that leverage time series optimizations for common IoT analytics patterns
Real-Time Processing: Implement change streams and real-time processing pipelines for immediate anomaly detection and alerting
Data Lifecycle: Establish automated data retention and archiving strategies to manage long-term storage costs

Production IoT Systems and Operational Excellence

Design time series systems for enterprise IoT requirements:

Scalable Architecture: Implement horizontally scalable time series infrastructure with proper sharding and distribution strategies
Performance Monitoring: Establish comprehensive monitoring for write performance, query latency, and storage utilization
Alert Management: Create intelligent alerting systems that reduce noise while ensuring critical issues are detected promptly
Edge Integration: Design systems that work efficiently with edge computing environments and intermittent connectivity
Security Implementation: Implement device authentication, data encryption, and access controls appropriate for IoT environments
Compliance Features: Build in data governance, audit trails, and regulatory compliance capabilities for industrial applications

Conclusion

MongoDB Time Series Collections provide comprehensive IoT data management capabilities that eliminate the complexity of traditional time-based partitioning and manual optimization through automatic bucketing, intelligent compression, and purpose-built query optimization. The native support for high-volume writes, real-time aggregations, and time-based analytics makes Time Series Collections ideal for modern IoT applications requiring both scale and performance.

Key Time Series Collections benefits include:

Automatic Optimization: Native bucketing and compression eliminate manual partitioning and maintenance overhead
High-Performance Writes: Optimized storage engine designed for high-volume, time-stamped data ingestion
Intelligent Compression: Automatic compression reduces storage costs by up to 90% compared to traditional approaches
Real-Time Analytics: Built-in aggregation optimization for time-based queries and real-time processing
Flexible Data Models: Rich document structure accommodates complex IoT metadata alongside time series measurements
SQL Accessibility: Familiar SQL-style time series operations through QueryLeaf for accessible IoT data management

Whether you're building industrial monitoring systems, smart city infrastructure, environmental sensors, or enterprise IoT platforms, MongoDB Time Series Collections with QueryLeaf's familiar SQL interface provides the foundation for scalable, efficient IoT data management.

QueryLeaf Integration: QueryLeaf seamlessly manages MongoDB Time Series Collections while providing SQL-familiar syntax for time series data operations, real-time analytics, and IoT-specific query patterns. Advanced time series capabilities including automatic bucketing, predictive analytics, and enterprise alerting are elegantly handled through familiar SQL constructs, making sophisticated IoT data management both powerful and accessible to SQL-oriented development teams.

The combination of MongoDB's robust time series capabilities with SQL-style data operations makes it an ideal platform for applications requiring both high-performance IoT data storage and familiar database interaction patterns, ensuring your time series infrastructure can scale efficiently while maintaining operational simplicity and developer productivity.

November 7, 2025
24 min read

MongoDB Change Streams for Event-Driven Microservices: Real-Time Data Processing and Distributed System Architecture

Modern distributed applications require sophisticated event-driven architectures that can react to data changes in real-time, maintain consistency across microservices, and process streaming data with minimal latency. Traditional database approaches struggle to provide efficient change detection, often requiring complex polling mechanisms, external message brokers, or custom trigger implementations that introduce significant overhead and operational complexity.

MongoDB Change Streams provide native, real-time change detection capabilities that enable applications to reactively process database modifications with millisecond latency. Unlike traditional approaches that require periodic polling or complex event sourcing implementations, Change Streams deliver ordered, resumable streams of database changes that integrate seamlessly with microservices architectures and event-driven patterns.

The Traditional Change Detection Challenge

Conventional approaches to detecting and reacting to database changes have significant limitations for modern applications:

-- Traditional PostgreSQL change detection - complex and resource-intensive

-- Polling-based approach with timestamps
CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INTEGER NOT NULL,
    order_status VARCHAR(50) DEFAULT 'pending',
    total_amount DECIMAL(10,2) NOT NULL,
    items JSONB NOT NULL,
    shipping_address JSONB NOT NULL,
    payment_info JSONB,

    -- Tracking fields for change detection
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    version INTEGER DEFAULT 1,

    -- Change tracking
    last_processed_at TIMESTAMP,
    change_events TEXT[] DEFAULT ARRAY[]::TEXT[],

    -- Indexes for polling queries
    INDEX idx_orders_updated_at (updated_at),
    INDEX idx_orders_status_updated (order_status, updated_at),
    INDEX idx_orders_processing (last_processed_at, updated_at)
);

-- Trigger-based change tracking (complex maintenance)
CREATE TABLE order_change_log (
    log_id SERIAL PRIMARY KEY,
    order_id INTEGER REFERENCES orders(order_id),
    change_type VARCHAR(20) NOT NULL, -- INSERT, UPDATE, DELETE
    old_values JSONB,
    new_values JSONB,
    changed_fields TEXT[],
    changed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    processed BOOLEAN DEFAULT FALSE,
    processing_attempts INTEGER DEFAULT 0,

    INDEX idx_change_log_processing (processed, changed_at),
    INDEX idx_change_log_order (order_id, changed_at)
);

-- Complex trigger function for change tracking
CREATE OR REPLACE FUNCTION track_order_changes()
RETURNS TRIGGER AS $$
DECLARE
    old_json JSONB;
    new_json JSONB;
    changed_fields TEXT[] := ARRAY[]::TEXT[];
    field_name TEXT;
BEGIN
    -- Handle different operation types
    IF TG_OP = 'DELETE' THEN
        INSERT INTO order_change_log (order_id, change_type, old_values)
        VALUES (OLD.order_id, 'DELETE', to_jsonb(OLD));

        RETURN OLD;
    END IF;

    IF TG_OP = 'INSERT' THEN
        INSERT INTO order_change_log (order_id, change_type, new_values)
        VALUES (NEW.order_id, 'INSERT', to_jsonb(NEW));

        RETURN NEW;
    END IF;

    -- UPDATE operation - detect changed fields
    old_json := to_jsonb(OLD);
    new_json := to_jsonb(NEW);

    -- Compare each field
    FOR field_name IN 
        SELECT DISTINCT key 
        FROM jsonb_each(old_json) 
        UNION 
        SELECT DISTINCT key 
        FROM jsonb_each(new_json)
    LOOP
        IF old_json->field_name != new_json->field_name OR 
           (old_json->field_name IS NULL) != (new_json->field_name IS NULL) THEN
            changed_fields := array_append(changed_fields, field_name);
        END IF;
    END LOOP;

    -- Only log if fields actually changed
    IF array_length(changed_fields, 1) > 0 THEN
        INSERT INTO order_change_log (
            order_id, change_type, old_values, new_values, changed_fields
        ) VALUES (
            NEW.order_id, 'UPDATE', old_json, new_json, changed_fields
        );

        -- Update version and tracking fields
        NEW.updated_at := CURRENT_TIMESTAMP;
        NEW.version := OLD.version + 1;
        NEW.change_events := array_append(OLD.change_events, 
            'updated_' || array_to_string(changed_fields, ',') || '_at_' || 
            extract(epoch from CURRENT_TIMESTAMP)::text
        );
    END IF;

    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- Create triggers (high overhead on write operations)
CREATE TRIGGER orders_change_trigger
    BEFORE INSERT OR UPDATE OR DELETE ON orders
    FOR EACH ROW EXECUTE FUNCTION track_order_changes();

-- Application polling logic (inefficient and high-latency)
WITH pending_changes AS (
    SELECT 
        ocl.*,
        o.order_status,
        o.customer_id,
        o.total_amount,

        -- Determine change significance
        CASE 
            WHEN 'order_status' = ANY(ocl.changed_fields) THEN 'high'
            WHEN 'total_amount' = ANY(ocl.changed_fields) THEN 'medium'
            WHEN 'items' = ANY(ocl.changed_fields) THEN 'medium'
            ELSE 'low'
        END as change_priority,

        -- Calculate processing delay
        EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - ocl.changed_at)) as delay_seconds

    FROM order_change_log ocl
    JOIN orders o ON ocl.order_id = o.order_id
    WHERE 
        ocl.processed = FALSE
        AND ocl.processing_attempts < 3
        AND ocl.changed_at >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
),
prioritized_changes AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY order_id 
            ORDER BY changed_at DESC
        ) as change_sequence,

        -- Batch processing grouping
        CASE change_priority
            WHEN 'high' THEN 1
            WHEN 'medium' THEN 2  
            ELSE 3
        END as processing_batch

    FROM pending_changes
)
SELECT 
    pc.log_id,
    pc.order_id,
    pc.change_type,
    pc.changed_fields,
    pc.old_values,
    pc.new_values,
    pc.change_priority,
    pc.delay_seconds,
    pc.processing_batch,

    -- Processing metadata
    CASE 
        WHEN pc.delay_seconds > 300 THEN 'DELAYED'
        WHEN pc.processing_attempts > 0 THEN 'RETRY'
        ELSE 'READY'
    END as processing_status,

    -- Related order context
    pc.order_status,
    pc.customer_id,
    pc.total_amount

FROM prioritized_changes pc
WHERE 
    pc.change_sequence = 1 -- Only latest change per order
    AND (
        pc.change_priority = 'high' 
        OR (pc.change_priority = 'medium' AND pc.delay_seconds < 60)
        OR (pc.change_priority = 'low' AND pc.delay_seconds < 300)
    )
ORDER BY 
    pc.processing_batch,
    pc.changed_at ASC
LIMIT 100;

-- Problems with traditional change detection:
-- 1. High overhead from triggers on every write operation
-- 2. Complex polling logic with high latency and resource usage  
-- 3. Risk of missing changes during application downtime
-- 4. Difficult to scale across multiple application instances
-- 5. No guaranteed delivery or ordering of change events
-- 6. Complex state management for processed vs unprocessed changes
-- 7. Performance degradation with high-volume write workloads
-- 8. Backup and restore complications with change log tables
-- 9. Cross-database change coordination challenges
-- 10. Limited filtering and transformation capabilities

-- MySQL change detection (even more limited)
CREATE TABLE mysql_orders (
    id INT AUTO_INCREMENT PRIMARY KEY,
    status VARCHAR(50),
    amount DECIMAL(10,2),
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,

    INDEX(updated_at)
);

-- Basic polling approach (no trigger support in standard MySQL)
SELECT 
    id, status, amount, updated_at,
    UNIX_TIMESTAMP() - UNIX_TIMESTAMP(updated_at) as age_seconds
FROM mysql_orders
WHERE updated_at > DATE_SUB(NOW(), INTERVAL 5 MINUTE)
ORDER BY updated_at DESC
LIMIT 1000;

-- MySQL limitations:
-- - No comprehensive trigger system for change tracking
-- - Limited JSON functionality for change metadata
-- - Basic polling only - no streaming capabilities
-- - Poor performance with high-volume change detection
-- - No built-in change stream or event sourcing support
-- - Complex custom implementation required for real-time processing
-- - Limited scalability for distributed architectures

MongoDB Change Streams provide powerful, real-time change detection with minimal overhead:

// MongoDB Change Streams - native real-time change processing
const { MongoClient } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('ecommerce');

// Comprehensive order management with change stream support
const setupOrderManagement = async () => {
  const orders = db.collection('orders');

  // Create sample order document structure
  const orderDocument = {
    _id: new ObjectId(),
    customerId: new ObjectId("64a1b2c3d4e5f6789012347a"),

    // Order details
    orderNumber: "ORD-2024-001234",
    status: "pending", // pending, confirmed, processing, shipped, delivered, cancelled

    // Financial information
    financial: {
      subtotal: 299.99,
      tax: 24.00,
      shipping: 15.99,
      discount: 25.00,
      total: 314.98,
      currency: "USD"
    },

    // Items with detailed tracking
    items: [
      {
        productId: new ObjectId("64b2c3d4e5f6789012347b1a"),
        sku: "LAPTOP-PRO-2024",
        name: "Professional Laptop 2024",
        quantity: 1,
        unitPrice: 1299.99,
        totalPrice: 1299.99,

        // Item-level tracking
        status: "pending", // pending, reserved, picked, shipped
        warehouse: "WEST-01",
        trackingNumber: null
      },
      {
        productId: new ObjectId("64b2c3d4e5f6789012347b1b"), 
        sku: "MOUSE-WIRELESS-PREMIUM",
        name: "Premium Wireless Mouse",
        quantity: 2,
        unitPrice: 79.99,
        totalPrice: 159.98,
        status: "pending",
        warehouse: "WEST-01",
        trackingNumber: null
      }
    ],

    // Customer information
    customer: {
      customerId: new ObjectId("64a1b2c3d4e5f6789012347a"),
      email: "customer@example.com",
      name: "John Smith",
      phone: "+1-555-0123",
      loyaltyTier: "gold"
    },

    // Shipping details
    shipping: {
      method: "standard", // standard, express, overnight
      carrier: "FedEx",
      trackingNumber: null,

      address: {
        street: "123 Main Street",
        unit: "Apt 4B", 
        city: "San Francisco",
        state: "CA",
        country: "USA",
        postalCode: "94105",

        // Geospatial data for logistics
        coordinates: {
          type: "Point",
          coordinates: [-122.3937, 37.7955]
        }
      },

      // Delivery preferences
      preferences: {
        signatureRequired: true,
        leaveAtDoor: false,
        deliveryInstructions: "Ring doorbell, apartment entrance on left side",
        preferredTimeWindow: "9AM-12PM"
      }
    },

    // Payment information
    payment: {
      method: "credit_card", // credit_card, paypal, apple_pay, etc.
      status: "pending", // pending, authorized, captured, failed, refunded
      transactionId: null,

      // Payment processor details (sensitive data encrypted/redacted)
      processor: {
        name: "stripe",
        paymentIntentId: "pi_1234567890abcdef",
        chargeId: null,
        receiptUrl: null
      },

      // Billing address
      billingAddress: {
        street: "123 Main Street", 
        city: "San Francisco",
        state: "CA",
        country: "USA",
        postalCode: "94105"
      }
    },

    // Fulfillment tracking
    fulfillment: {
      warehouseId: "WEST-01",
      assignedAt: null,
      pickedAt: null,
      packedAt: null,
      shippedAt: null,
      deliveredAt: null,

      // Fulfillment team
      assignedTo: {
        pickerId: null,
        packerId: null
      },

      // Special handling
      specialInstructions: [],
      requiresSignature: true,
      isFragile: false,
      isGift: false
    },

    // Analytics and tracking
    analytics: {
      source: "web", // web, mobile, api, phone
      campaign: "summer_sale_2024",
      referrer: "google_ads",
      sessionId: "sess_abc123def456",

      // Customer journey
      customerJourney: [
        {
          event: "product_view",
          productId: new ObjectId("64b2c3d4e5f6789012347b1a"),
          timestamp: new Date("2024-11-07T14:15:00Z")
        },
        {
          event: "add_to_cart", 
          productId: new ObjectId("64b2c3d4e5f6789012347b1a"),
          timestamp: new Date("2024-11-07T14:18:00Z")
        },
        {
          event: "checkout_initiated",
          timestamp: new Date("2024-11-07T14:25:00Z")
        }
      ]
    },

    // Communication history
    communications: [
      {
        type: "email",
        subject: "Order Confirmation",
        sentAt: new Date("2024-11-07T14:30:00Z"),
        status: "sent",
        templateId: "order_confirmation_v2"
      }
    ],

    // Audit trail
    audit: {
      createdAt: new Date("2024-11-07T14:30:00Z"),
      createdBy: "system",
      updatedAt: new Date("2024-11-07T14:30:00Z"),
      updatedBy: "system",
      version: 1,

      // Change history for compliance
      changes: []
    },

    // System metadata
    metadata: {
      environment: "production",
      region: "us-west-1",
      tenantId: "tenant_123",

      // Feature flags
      features: {
        realTimeTracking: true,
        smsNotifications: true,
        expressDelivery: false
      }
    }
  };

  // Insert sample order
  await orders.insertOne(orderDocument);

  // Create indexes for change stream performance
  await orders.createIndex({ status: 1, "audit.updatedAt": 1 });
  await orders.createIndex({ customerId: 1, "audit.createdAt": -1 });
  await orders.createIndex({ "items.status": 1 });
  await orders.createIndex({ "payment.status": 1 });
  await orders.createIndex({ "shipping.trackingNumber": 1 });

  console.log('Order management setup completed');
  return orders;
};

// Advanced Change Stream processing for event-driven architecture
class OrderEventProcessor {
  constructor(db) {
    this.db = db;
    this.orders = db.collection('orders');
    this.eventHandlers = new Map();
    this.changeStream = null;
    this.resumeToken = null;
    this.processedEvents = new Set();

    // Event processing statistics
    this.stats = {
      eventsProcessed: 0,
      eventsSkipped: 0,
      processingErrors: 0,
      lastProcessedAt: null
    };
  }

  async startChangeStreamProcessing() {
    console.log('Starting MongoDB Change Stream processing...');

    // Configure change stream with comprehensive options
    const changeStreamOptions = {
      // Pipeline to filter relevant changes
      pipeline: [
        {
          // Only process specific operation types
          $match: {
            operationType: { $in: ['insert', 'update', 'delete', 'replace'] }
          }
        },
        {
          // Add additional metadata for processing
          $addFields: {
            // Extract key fields for quick processing decisions
            documentKey: '$documentKey',
            changeType: '$operationType',
            changedFields: { $objectToArray: '$updateDescription.updatedFields' },
            removedFields: '$updateDescription.removedFields',

            // Processing metadata
            processingPriority: {
              $switch: {
                branches: [
                  // High priority changes
                  {
                    case: {
                      $or: [
                        { $eq: ['$operationType', 'insert'] },
                        { $eq: ['$operationType', 'delete'] },
                        {
                          $anyElementTrue: {
                            $map: {
                              input: '$updateDescription.updatedFields',
                              in: {
                                $regexMatch: {
                                  input: '$$this.k',
                                  regex: '^(status|payment\\.status|fulfillment\\.).*'
                                }
                              }
                            }
                          }
                        }
                      ]
                    },
                    then: 'high'
                  },
                  // Medium priority changes
                  {
                    case: {
                      $anyElementTrue: {
                        $map: {
                          input: '$updateDescription.updatedFields', 
                          in: {
                            $regexMatch: {
                              input: '$$this.k',
                              regex: '^(items\\.|shipping\\.|customer\\.).*'
                            }
                          }
                        }
                      }
                    },
                    then: 'medium'
                  }
                ],
                default: 'low'
              }
            }
          }
        }
      ],

      // Change stream configuration
      fullDocument: 'updateLookup', // Always include full document
      fullDocumentBeforeChange: 'whenAvailable', // Include pre-change document when available
      maxAwaitTimeMS: 1000, // Maximum wait time for new changes
      batchSize: 100, // Process changes in batches

      // Resume from stored token if available
      startAfter: this.resumeToken
    };

    try {
      // Create change stream on orders collection
      this.changeStream = this.orders.watch(changeStreamOptions);

      console.log('Change stream established - listening for events...');

      // Process change events asynchronously
      for await (const change of this.changeStream) {
        try {
          await this.processChangeEvent(change);

          // Store resume token for recovery
          this.resumeToken = change._id;

          // Update statistics
          this.stats.eventsProcessed++;
          this.stats.lastProcessedAt = new Date();

        } catch (error) {
          console.error('Error processing change event:', error);
          this.stats.processingErrors++;

          // Implement retry logic or dead letter queue here
          await this.handleProcessingError(change, error);
        }
      }

    } catch (error) {
      console.error('Change stream error:', error);

      // Implement reconnection logic
      await this.handleChangeStreamError(error);
    }
  }

  async processChangeEvent(change) {
    const { operationType, fullDocument, documentKey, updateDescription } = change;
    const orderId = documentKey._id;

    console.log(`Processing ${operationType} event for order ${orderId}`);

    // Prevent duplicate processing
    const eventId = `${orderId}_${change._id.toString()}`;
    if (this.processedEvents.has(eventId)) {
      console.log(`Skipping duplicate event: ${eventId}`);
      this.stats.eventsSkipped++;
      return;
    }

    // Add to processed events (with TTL cleanup)
    this.processedEvents.add(eventId);
    setTimeout(() => this.processedEvents.delete(eventId), 300000); // 5 minute TTL

    // Route event based on operation type and changed fields
    switch (operationType) {
      case 'insert':
        await this.handleOrderCreated(fullDocument);
        break;

      case 'update':
        await this.handleOrderUpdated(fullDocument, updateDescription, change.fullDocumentBeforeChange);
        break;

      case 'delete':
        await this.handleOrderDeleted(documentKey);
        break;

      case 'replace':
        await this.handleOrderReplaced(fullDocument, change.fullDocumentBeforeChange);
        break;

      default:
        console.warn(`Unhandled operation type: ${operationType}`);
    }
  }

  async handleOrderCreated(order) {
    console.log(`New order created: ${order.orderNumber}`);

    // Parallel processing of order creation events
    const creationTasks = [
      // Send order confirmation email
      this.sendOrderConfirmation(order),

      // Reserve inventory for ordered items
      this.reserveInventory(order),

      // Create payment authorization
      this.authorizePayment(order),

      // Notify fulfillment center
      this.notifyFulfillmentCenter(order),

      // Update customer metrics
      this.updateCustomerMetrics(order),

      // Log analytics event
      this.logAnalyticsEvent('order_created', order),

      // Check for fraud indicators
      this.performFraudCheck(order)
    ];

    try {
      const results = await Promise.allSettled(creationTasks);

      // Handle any failed tasks
      const failedTasks = results.filter(result => result.status === 'rejected');
      if (failedTasks.length > 0) {
        console.error(`${failedTasks.length} tasks failed for order creation:`, failedTasks);
        await this.handlePartialFailure(order, failedTasks);
      }

      console.log(`Order creation processing completed: ${order.orderNumber}`);

    } catch (error) {
      console.error(`Error processing order creation: ${error}`);
      throw error;
    }
  }

  async handleOrderUpdated(order, updateDescription, previousOrder) {
    console.log(`Order updated: ${order.orderNumber}`);

    const updatedFields = Object.keys(updateDescription.updatedFields || {});
    const removedFields = updateDescription.removedFields || [];

    // Process specific field changes
    for (const fieldPath of updatedFields) {
      await this.processFieldChange(order, previousOrder, fieldPath, updateDescription.updatedFields[fieldPath]);
    }

    // Handle removed fields
    for (const fieldPath of removedFields) {
      await this.processFieldRemoval(order, previousOrder, fieldPath);
    }

    // Log comprehensive change event
    await this.logAnalyticsEvent('order_updated', {
      order,
      changedFields: updatedFields,
      removedFields: removedFields
    });
  }

  async processFieldChange(order, previousOrder, fieldPath, newValue) {
    console.log(`Field changed: ${fieldPath} = ${JSON.stringify(newValue)}`);

    // Route processing based on changed field
    if (fieldPath === 'status') {
      await this.handleStatusChange(order, previousOrder);
    } else if (fieldPath.startsWith('payment.')) {
      await this.handlePaymentChange(order, previousOrder, fieldPath, newValue);
    } else if (fieldPath.startsWith('fulfillment.')) {
      await this.handleFulfillmentChange(order, previousOrder, fieldPath, newValue);
    } else if (fieldPath.startsWith('shipping.')) {
      await this.handleShippingChange(order, previousOrder, fieldPath, newValue);
    } else if (fieldPath.startsWith('items.')) {
      await this.handleItemChange(order, previousOrder, fieldPath, newValue);
    }
  }

  async handleStatusChange(order, previousOrder) {
    const newStatus = order.status;
    const previousStatus = previousOrder?.status;

    console.log(`Order status changed: ${previousStatus} → ${newStatus}`);

    // Status-specific processing
    switch (newStatus) {
      case 'confirmed':
        await Promise.all([
          this.processPayment(order),
          this.sendStatusUpdateEmail(order, 'order_confirmed'),
          this.createFulfillmentTasks(order)
        ]);
        break;

      case 'processing':
        await Promise.all([
          this.notifyWarehouse(order),
          this.updateInventoryReservations(order),
          this.sendStatusUpdateEmail(order, 'order_processing')
        ]);
        break;

      case 'shipped':
        await Promise.all([
          this.generateTrackingInfo(order),
          this.sendShippingNotification(order),
          this.releaseInventoryReservations(order),
          this.updateDeliveryEstimate(order)
        ]);
        break;

      case 'delivered':
        await Promise.all([
          this.sendDeliveryConfirmation(order),
          this.triggerReviewRequest(order),
          this.updateCustomerLoyaltyPoints(order),
          this.closeOrderInSystems(order)
        ]);
        break;

      case 'cancelled':
        await Promise.all([
          this.processRefund(order),
          this.releaseInventoryReservations(order),
          this.sendCancellationNotification(order),
          this.updateAnalytics(order, 'cancelled')
        ]);
        break;
    }
  }

  async handlePaymentChange(order, previousOrder, fieldPath, newValue) {
    console.log(`Payment change: ${fieldPath} = ${newValue}`);

    if (fieldPath === 'payment.status') {
      switch (newValue) {
        case 'authorized':
          await this.handlePaymentAuthorized(order);
          break;
        case 'captured':
          await this.handlePaymentCaptured(order);
          break;
        case 'failed':
          await this.handlePaymentFailed(order);
          break;
        case 'refunded':
          await this.handlePaymentRefunded(order);
          break;
      }
    }
  }

  async handleShippingChange(order, previousOrder, fieldPath, newValue) {
    if (fieldPath === 'shipping.trackingNumber' && newValue) {
      console.log(`Tracking number assigned: ${newValue}`);

      // Send tracking information to customer
      await Promise.all([
        this.sendTrackingInfo(order),
        this.setupDeliveryNotifications(order),
        this.updateShippingPartnerSystems(order)
      ]);
    }
  }

  // Implementation of helper methods for event processing
  async sendOrderConfirmation(order) {
    console.log(`Sending order confirmation for ${order.orderNumber}`);

    // Simulate email service call
    const emailData = {
      to: order.customer.email,
      subject: `Order Confirmation - ${order.orderNumber}`,
      template: 'order_confirmation',
      data: {
        orderNumber: order.orderNumber,
        customerName: order.customer.name,
        items: order.items,
        total: order.financial.total,
        estimatedDelivery: this.calculateDeliveryDate(order)
      }
    };

    // Would integrate with actual email service
    await this.sendEmail(emailData);
  }

  async reserveInventory(order) {
    console.log(`Reserving inventory for order ${order.orderNumber}`);

    const inventoryUpdates = order.items.map(item => ({
      productId: item.productId,
      sku: item.sku,
      quantity: item.quantity,
      reservedFor: order._id,
      reservedAt: new Date()
    }));

    // Update inventory collection
    const inventory = this.db.collection('inventory');

    for (const update of inventoryUpdates) {
      await inventory.updateOne(
        { 
          productId: update.productId,
          availableQuantity: { $gte: update.quantity }
        },
        {
          $inc: { 
            availableQuantity: -update.quantity,
            reservedQuantity: update.quantity
          },
          $push: {
            reservations: {
              orderId: update.reservedFor,
              quantity: update.quantity,
              reservedAt: update.reservedAt
            }
          }
        }
      );
    }
  }

  async authorizePayment(order) {
    console.log(`Authorizing payment for order ${order.orderNumber}`);

    // Simulate payment processor call
    const paymentResult = await this.callPaymentProcessor({
      action: 'authorize',
      amount: order.financial.total,
      currency: order.financial.currency,
      paymentMethod: order.payment.method,
      paymentIntentId: order.payment.processor.paymentIntentId
    });

    if (paymentResult.success) {
      // Update order with payment authorization
      await this.orders.updateOne(
        { _id: order._id },
        {
          $set: {
            'payment.status': 'authorized',
            'payment.processor.chargeId': paymentResult.chargeId,
            'audit.updatedAt': new Date(),
            'audit.updatedBy': 'payment_processor'
          },
          $inc: { 'audit.version': 1 }
        }
      );
    } else {
      throw new Error(`Payment authorization failed: ${paymentResult.error}`);
    }
  }

  // Helper methods (simplified implementations)
  async sendEmail(emailData) {
    console.log(`Email sent: ${emailData.subject} to ${emailData.to}`);
  }

  async callPaymentProcessor(request) {
    // Simulate payment processor response
    await new Promise(resolve => setTimeout(resolve, 100));
    return {
      success: true,
      chargeId: `ch_${Math.random().toString(36).substr(2, 9)}`
    };
  }

  calculateDeliveryDate(order) {
    const baseDate = new Date();
    const daysToAdd = order.shipping.method === 'express' ? 2 : 
                      order.shipping.method === 'overnight' ? 1 : 5;
    baseDate.setDate(baseDate.getDate() + daysToAdd);
    return baseDate;
  }

  async logAnalyticsEvent(eventType, data) {
    const analytics = this.db.collection('analytics_events');
    await analytics.insertOne({
      eventType,
      data,
      timestamp: new Date(),
      source: 'change_stream_processor'
    });
  }

  async handleProcessingError(change, error) {
    console.error(`Processing error for change ${change._id}:`, error);

    // Log error for monitoring
    const errorLog = this.db.collection('processing_errors');
    await errorLog.insertOne({
      changeId: change._id,
      operationType: change.operationType,
      documentKey: change.documentKey,
      error: {
        message: error.message,
        stack: error.stack
      },
      timestamp: new Date(),
      retryCount: 0
    });
  }

  async handleChangeStreamError(error) {
    console.error('Change stream error:', error);

    // Wait before attempting reconnection
    await new Promise(resolve => setTimeout(resolve, 5000));

    // Restart change stream processing
    await this.startChangeStreamProcessing();
  }

  getProcessingStatistics() {
    return {
      ...this.stats,
      resumeToken: this.resumeToken,
      processedEventsInMemory: this.processedEvents.size
    };
  }
}

// Multi-service change stream coordination
class DistributedOrderEventSystem {
  constructor(db) {
    this.db = db;
    this.serviceProcessors = new Map();
    this.eventBus = new Map(); // Simple in-memory event bus
    this.globalStats = {
      totalEventsProcessed: 0,
      servicesActive: 0,
      lastProcessingTime: null
    };
  }

  async setupDistributedProcessing() {
    console.log('Setting up distributed order event processing...');

    // Create specialized processors for different services
    const services = [
      'inventory-service',
      'payment-service', 
      'fulfillment-service',
      'notification-service',
      'analytics-service',
      'customer-service'
    ];

    for (const serviceName of services) {
      const processor = new ServiceSpecificProcessor(this.db, serviceName, this);
      await processor.initialize();
      this.serviceProcessors.set(serviceName, processor);
    }

    console.log(`Distributed processing setup completed with ${services.length} services`);
  }

  async publishEvent(eventType, data, source) {
    console.log(`Publishing event: ${eventType} from ${source}`);

    // Add to event bus
    if (!this.eventBus.has(eventType)) {
      this.eventBus.set(eventType, []);
    }

    const event = {
      id: new ObjectId(),
      type: eventType,
      data,
      source,
      timestamp: new Date(),
      processed: new Set()
    };

    this.eventBus.get(eventType).push(event);

    // Notify interested services
    for (const [serviceName, processor] of this.serviceProcessors.entries()) {
      if (processor.isInterestedInEvent(eventType)) {
        await processor.processEvent(event);
      }
    }

    this.globalStats.totalEventsProcessed++;
    this.globalStats.lastProcessingTime = new Date();
  }

  getGlobalStatistics() {
    const serviceStats = {};
    for (const [serviceName, processor] of this.serviceProcessors.entries()) {
      serviceStats[serviceName] = processor.getStatistics();
    }

    return {
      global: this.globalStats,
      services: serviceStats,
      eventBusSize: Array.from(this.eventBus.values()).reduce((total, events) => total + events.length, 0)
    };
  }
}

// Service-specific processor for handling events relevant to each microservice
class ServiceSpecificProcessor {
  constructor(db, serviceName, eventSystem) {
    this.db = db;
    this.serviceName = serviceName;
    this.eventSystem = eventSystem;
    this.eventFilters = new Map();
    this.stats = {
      eventsProcessed: 0,
      eventsFiltered: 0,
      lastProcessedAt: null
    };

    this.setupEventFilters();
  }

  setupEventFilters() {
    // Define which events each service cares about
    const filterConfigs = {
      'inventory-service': [
        'order_created',
        'order_cancelled', 
        'item_status_changed'
      ],
      'payment-service': [
        'order_created',
        'order_confirmed',
        'order_cancelled',
        'payment_status_changed'
      ],
      'fulfillment-service': [
        'order_confirmed',
        'payment_authorized',
        'inventory_reserved'
      ],
      'notification-service': [
        'order_created',
        'status_changed',
        'payment_status_changed',
        'shipping_updated'
      ],
      'analytics-service': [
        '*' // Analytics service processes all events
      ],
      'customer-service': [
        'order_created',
        'order_delivered',
        'order_cancelled'
      ]
    };

    const filters = filterConfigs[this.serviceName] || [];
    filters.forEach(filter => this.eventFilters.set(filter, true));
  }

  async initialize() {
    console.log(`Initializing ${this.serviceName} processor...`);

    // Service-specific initialization
    switch (this.serviceName) {
      case 'inventory-service':
        await this.initializeInventoryTracking();
        break;
      case 'payment-service':
        await this.initializePaymentProcessing();
        break;
      // ... other services
    }

    console.log(`${this.serviceName} processor initialized`);
  }

  isInterestedInEvent(eventType) {
    return this.eventFilters.has('*') || this.eventFilters.has(eventType);
  }

  async processEvent(event) {
    if (!this.isInterestedInEvent(event.type)) {
      this.stats.eventsFiltered++;
      return;
    }

    console.log(`${this.serviceName} processing event: ${event.type}`);

    try {
      // Service-specific event processing
      await this.handleServiceEvent(event);

      event.processed.add(this.serviceName);
      this.stats.eventsProcessed++;
      this.stats.lastProcessedAt = new Date();

    } catch (error) {
      console.error(`${this.serviceName} error processing event ${event.id}:`, error);
      throw error;
    }
  }

  async handleServiceEvent(event) {
    // Dispatch to service-specific handlers
    const handlerMethod = `handle${event.type.split('_').map(word => 
      word.charAt(0).toUpperCase() + word.slice(1)
    ).join('')}`;

    if (typeof this[handlerMethod] === 'function') {
      await this[handlerMethod](event);
    } else {
      console.warn(`No handler found: ${handlerMethod} in ${this.serviceName}`);
    }
  }

  // Service-specific event handlers
  async handleOrderCreated(event) {
    if (this.serviceName === 'inventory-service') {
      await this.reserveInventoryForOrder(event.data);
    } else if (this.serviceName === 'notification-service') {
      await this.sendOrderConfirmationEmail(event.data);
    }
  }

  async handleStatusChanged(event) {
    if (this.serviceName === 'customer-service') {
      await this.updateCustomerOrderHistory(event.data);
    }
  }

  // Helper methods for specific services
  async reserveInventoryForOrder(order) {
    console.log(`Reserving inventory for order: ${order.orderNumber}`);
    // Implementation would interact with inventory management system
  }

  async sendOrderConfirmationEmail(order) {
    console.log(`Sending confirmation email for order: ${order.orderNumber}`);
    // Implementation would use email service
  }

  async initializeInventoryTracking() {
    // Setup inventory-specific collections and indexes
    const inventory = this.db.collection('inventory');
    await inventory.createIndex({ productId: 1, warehouse: 1 });
  }

  async initializePaymentProcessing() {
    // Setup payment-specific configurations
    console.log('Payment service initialized with fraud detection enabled');
  }

  getStatistics() {
    return this.stats;
  }
}

// Benefits of MongoDB Change Streams:
// - Real-time change detection with minimal latency
// - Native event sourcing capabilities without complex triggers  
// - Resumable streams with automatic recovery from failures
// - Ordered event processing with guaranteed delivery
// - Fine-grained filtering and transformation pipelines
// - Horizontal scaling across multiple application instances
// - Integration with MongoDB's replica set and sharding architecture
// - No polling overhead or resource waste
// - Built-in clustering and high availability support
// - Simple integration with existing MongoDB applications

module.exports = {
  setupOrderManagement,
  OrderEventProcessor,
  DistributedOrderEventSystem,
  ServiceSpecificProcessor
};

Understanding MongoDB Change Streams Architecture

Change Stream Processing Patterns

MongoDB Change Streams operate at the replica set level and provide several key capabilities for event-driven architectures:

// Advanced change stream patterns and configurations
class AdvancedChangeStreamManager {
  constructor(client) {
    this.client = client;
    this.db = client.db('ecommerce');
    this.changeStreams = new Map();
    this.resumeTokens = new Map();
    this.errorHandlers = new Map();
  }

  async setupMultiCollectionStreams() {
    console.log('Setting up multi-collection change streams...');

    // 1. Collection-specific streams with targeted processing
    const collectionConfigs = [
      {
        name: 'orders',
        pipeline: [
          {
            $match: {
              $or: [
                { operationType: 'insert' },
                { operationType: 'update', 'updateDescription.updatedFields.status': { $exists: true } },
                { operationType: 'update', 'updateDescription.updatedFields.payment.status': { $exists: true } }
              ]
            }
          }
        ],
        handler: this.handleOrderChanges.bind(this)
      },
      {
        name: 'inventory', 
        pipeline: [
          {
            $match: {
              $or: [
                { operationType: 'update', 'updateDescription.updatedFields.availableQuantity': { $exists: true } },
                { operationType: 'update', 'updateDescription.updatedFields.reservedQuantity': { $exists: true } }
              ]
            }
          }
        ],
        handler: this.handleInventoryChanges.bind(this)
      },
      {
        name: 'customers',
        pipeline: [
          {
            $match: {
              operationType: { $in: ['insert', 'update'] },
              $or: [
                { 'fullDocument.loyaltyTier': { $exists: true } },
                { 'updateDescription.updatedFields.loyaltyTier': { $exists: true } },
                { 'updateDescription.updatedFields.preferences': { $exists: true } }
              ]
            }
          }
        ],
        handler: this.handleCustomerChanges.bind(this)
      }
    ];

    // Start streams for each collection
    for (const config of collectionConfigs) {
      await this.startCollectionStream(config);
    }

    // 2. Database-level change stream for cross-collection events
    await this.startDatabaseStream();

    console.log(`Started ${collectionConfigs.length + 1} change streams`);
  }

  async startCollectionStream(config) {
    const collection = this.db.collection(config.name);
    const resumeToken = this.resumeTokens.get(config.name);

    const options = {
      pipeline: config.pipeline,
      fullDocument: 'updateLookup',
      fullDocumentBeforeChange: 'whenAvailable',
      maxAwaitTimeMS: 1000,
      startAfter: resumeToken
    };

    try {
      const changeStream = collection.watch(options);
      this.changeStreams.set(config.name, changeStream);

      // Process changes asynchronously
      this.processChangeStream(config.name, changeStream, config.handler);

    } catch (error) {
      console.error(`Error starting stream for ${config.name}:`, error);
      this.scheduleStreamRestart(config);
    }
  }

  async startDatabaseStream() {
    // Database-level stream for cross-collection coordination
    const pipeline = [
      {
        $match: {
          // Monitor for significant cross-collection events
          $or: [
            { 
              operationType: 'insert',
              'fullDocument.metadata.requiresCrossCollectionSync': true
            },
            {
              operationType: 'update',
              'updateDescription.updatedFields.syncRequired': { $exists: true }
            }
          ]
        }
      },
      {
        $addFields: {
          // Add processing metadata
          collectionName: '$ns.coll',
          databaseName: '$ns.db',
          changeSignature: {
            $concat: [
              '$ns.coll', '_',
              '$operationType', '_',
              { $toString: '$clusterTime' }
            ]
          }
        }
      }
    ];

    const options = {
      pipeline,
      fullDocument: 'updateLookup',
      maxAwaitTimeMS: 2000
    };

    const dbStream = this.db.watch(options);
    this.changeStreams.set('_database', dbStream);

    this.processChangeStream('_database', dbStream, this.handleDatabaseChanges.bind(this));
  }

  async processChangeStream(streamName, changeStream, handler) {
    console.log(`Processing change stream: ${streamName}`);

    try {
      for await (const change of changeStream) {
        try {
          // Store resume token
          this.resumeTokens.set(streamName, change._id);

          // Process the change
          await handler(change);

          // Persist resume token for recovery
          await this.persistResumeToken(streamName, change._id);

        } catch (processingError) {
          console.error(`Error processing change in ${streamName}:`, processingError);
          await this.handleProcessingError(streamName, change, processingError);
        }
      }
    } catch (streamError) {
      console.error(`Stream error in ${streamName}:`, streamError);
      await this.handleStreamError(streamName, streamError);
    }
  }

  async handleOrderChanges(change) {
    console.log(`Order change detected: ${change.operationType}`);

    const { operationType, fullDocument, documentKey, updateDescription } = change;

    // Route based on change type and affected fields
    if (operationType === 'insert') {
      await this.processNewOrder(fullDocument);
    } else if (operationType === 'update') {
      const updatedFields = Object.keys(updateDescription.updatedFields || {});

      // Process specific field updates
      if (updatedFields.includes('status')) {
        await this.processOrderStatusChange(fullDocument, updateDescription);
      }

      if (updatedFields.some(field => field.startsWith('payment.'))) {
        await this.processPaymentChange(fullDocument, updateDescription);
      }

      if (updatedFields.some(field => field.startsWith('fulfillment.'))) {
        await this.processFulfillmentChange(fullDocument, updateDescription);
      }
    }
  }

  async handleInventoryChanges(change) {
    console.log(`Inventory change detected: ${change.operationType}`);

    const { fullDocument, updateDescription } = change;
    const updatedFields = Object.keys(updateDescription.updatedFields || {});

    // Check for low stock conditions
    if (updatedFields.includes('availableQuantity')) {
      const newQuantity = updateDescription.updatedFields.availableQuantity;
      if (newQuantity <= fullDocument.reorderLevel) {
        await this.triggerReorderAlert(fullDocument);
      }
    }

    // Propagate inventory changes to dependent systems
    await this.syncInventoryWithExternalSystems(fullDocument, updatedFields);
  }

  async handleCustomerChanges(change) {
    console.log(`Customer change detected: ${change.operationType}`);

    const { fullDocument, updateDescription } = change;

    // Handle loyalty tier changes
    if (updateDescription?.updatedFields?.loyaltyTier) {
      await this.processLoyaltyTierChange(fullDocument, updateDescription);
    }

    // Handle preference updates
    if (updateDescription?.updatedFields?.preferences) {
      await this.updatePersonalizationEngine(fullDocument);
    }
  }

  async handleDatabaseChanges(change) {
    console.log(`Database-level change: ${change.collectionName}.${change.operationType}`);

    // Handle cross-collection synchronization events
    await this.coordinateCrossCollectionSync(change);
  }

  // Resilience and error handling
  async handleStreamError(streamName, error) {
    console.error(`Stream ${streamName} encountered error:`, error);

    // Implement exponential backoff for reconnection
    const baseDelay = 1000; // 1 second
    const maxRetries = 5;
    let retryCount = 0;

    while (retryCount < maxRetries) {
      const delay = baseDelay * Math.pow(2, retryCount);
      console.log(`Attempting to restart ${streamName} in ${delay}ms (retry ${retryCount + 1})`);

      await new Promise(resolve => setTimeout(resolve, delay));

      try {
        // Restart the specific stream
        await this.restartStream(streamName);
        console.log(`Successfully restarted ${streamName}`);
        break;
      } catch (restartError) {
        console.error(`Failed to restart ${streamName}:`, restartError);
        retryCount++;
      }
    }

    if (retryCount >= maxRetries) {
      console.error(`Failed to restart ${streamName} after ${maxRetries} attempts`);
      // Implement alerting for operations team
      await this.sendOperationalAlert(`Critical: Change stream ${streamName} failed to restart`);
    }
  }

  async restartStream(streamName) {
    // Close existing stream if it exists
    const existingStream = this.changeStreams.get(streamName);
    if (existingStream) {
      try {
        await existingStream.close();
      } catch (closeError) {
        console.warn(`Error closing ${streamName}:`, closeError);
      }
      this.changeStreams.delete(streamName);
    }

    // Restart based on stream type
    if (streamName === '_database') {
      await this.startDatabaseStream();
    } else {
      // Find and restart collection stream
      const config = this.getCollectionConfig(streamName);
      if (config) {
        await this.startCollectionStream(config);
      }
    }
  }

  async persistResumeToken(streamName, resumeToken) {
    // Store resume tokens in MongoDB for crash recovery
    const tokenCollection = this.db.collection('change_stream_tokens');

    await tokenCollection.updateOne(
      { streamName },
      {
        $set: {
          resumeToken,
          lastUpdated: new Date(),
          streamName
        }
      },
      { upsert: true }
    );
  }

  async loadPersistedResumeTokens() {
    console.log('Loading persisted resume tokens...');

    const tokenCollection = this.db.collection('change_stream_tokens');
    const tokens = await tokenCollection.find({}).toArray();

    for (const token of tokens) {
      this.resumeTokens.set(token.streamName, token.resumeToken);
      console.log(`Loaded resume token for ${token.streamName}`);
    }
  }

  // Performance monitoring and optimization
  async getChangeStreamMetrics() {
    const metrics = {
      activeStreams: this.changeStreams.size,
      resumeTokens: this.resumeTokens.size,
      streamStatus: {},
      systemHealth: await this.checkSystemHealth()
    };

    // Check status of each stream
    for (const [streamName, stream] of this.changeStreams.entries()) {
      metrics.streamStatus[streamName] = {
        isActive: !stream.closed,
        hasResumeToken: this.resumeTokens.has(streamName)
      };
    }

    return metrics;
  }

  async checkSystemHealth() {
    try {
      // Check MongoDB replica set status
      const replicaSetStatus = await this.client.db('admin').admin().replSetGetStatus();

      const healthMetrics = {
        replicaSetHealthy: replicaSetStatus.ok === 1,
        primaryNode: replicaSetStatus.members.find(member => member.state === 1)?.name,
        secondaryNodes: replicaSetStatus.members.filter(member => member.state === 2).length,
        oplogSize: await this.getOplogSize(),
        changeStreamSupported: true
      };

      return healthMetrics;
    } catch (error) {
      console.error('Error checking system health:', error);
      return {
        replicaSetHealthy: false,
        error: error.message
      };
    }
  }

  async getOplogSize() {
    // Check oplog size to ensure sufficient retention for change streams
    const oplog = this.client.db('local').collection('oplog.rs');
    const stats = await oplog.stats();

    return {
      sizeBytes: stats.size,
      sizeMB: Math.round(stats.size / 1024 / 1024),
      maxSizeBytes: stats.maxSize,
      maxSizeMB: Math.round(stats.maxSize / 1024 / 1024),
      utilizationPercent: Math.round((stats.size / stats.maxSize) * 100)
    };
  }

  // Cleanup and shutdown
  async shutdown() {
    console.log('Shutting down change stream manager...');

    const shutdownPromises = [];

    // Close all active streams
    for (const [streamName, stream] of this.changeStreams.entries()) {
      console.log(`Closing stream: ${streamName}`);
      shutdownPromises.push(
        stream.close().catch(error => 
          console.warn(`Error closing ${streamName}:`, error)
        )
      );
    }

    await Promise.allSettled(shutdownPromises);

    // Clear internal state
    this.changeStreams.clear();
    this.resumeTokens.clear();

    console.log('Change stream manager shutdown complete');
  }
}

// Helper methods for event processing
async function processNewOrder(order) {
  console.log(`Processing new order: ${order.orderNumber}`);

  // Comprehensive order processing workflow
  const processingTasks = [
    validateOrderData(order),
    checkInventoryAvailability(order), 
    validatePaymentMethod(order),
    calculateShippingOptions(order),
    applyPromotionsAndDiscounts(order),
    createFulfillmentWorkflow(order),
    sendCustomerNotifications(order),
    updateAnalyticsAndReporting(order)
  ];

  const results = await Promise.allSettled(processingTasks);

  // Handle any failed tasks
  const failures = results.filter(result => result.status === 'rejected');
  if (failures.length > 0) {
    console.error(`${failures.length} tasks failed for order ${order.orderNumber}`);
    await handleOrderProcessingFailures(order, failures);
  }
}

async function triggerReorderAlert(inventoryItem) {
  console.log(`Low stock alert: ${inventoryItem.sku} - ${inventoryItem.availableQuantity} remaining`);

  // Create automatic reorder if conditions are met
  if (inventoryItem.autoReorder && inventoryItem.availableQuantity <= inventoryItem.criticalLevel) {
    const reorderQuantity = inventoryItem.maxStock - inventoryItem.availableQuantity;

    await createPurchaseOrder({
      productId: inventoryItem.productId,
      sku: inventoryItem.sku,
      quantity: reorderQuantity,
      supplier: inventoryItem.preferredSupplier,
      urgency: 'high',
      reason: 'automated_reorder_low_stock'
    });
  }
}

// Example helper implementations
async function validateOrderData(order) {
  // Comprehensive order validation
  const validationResults = {
    customerValid: await validateCustomer(order.customerId),
    itemsValid: await validateOrderItems(order.items),
    addressValid: await validateShippingAddress(order.shipping.address),
    paymentValid: await validatePaymentInfo(order.payment)
  };

  const isValid = Object.values(validationResults).every(result => result === true);
  if (!isValid) {
    throw new Error(`Order validation failed: ${JSON.stringify(validationResults)}`);
  }
}

async function createPurchaseOrder(orderData) {
  console.log(`Creating purchase order: ${orderData.sku} x ${orderData.quantity}`);
  // Implementation would create purchase order in procurement system
}

async function sendOperationalAlert(message) {
  console.error(`OPERATIONAL ALERT: ${message}`);
  // Implementation would integrate with alerting system (PagerDuty, Slack, etc.)
}

SQL-Style Change Stream Operations with QueryLeaf

QueryLeaf provides SQL-familiar syntax for MongoDB Change Stream operations:

-- QueryLeaf change stream operations with SQL-familiar syntax

-- Create change stream listeners with SQL-style syntax
CREATE CHANGE STREAM order_status_changes 
ON orders 
WHERE 
  operation_type IN ('update', 'insert')
  AND (
    changed_fields CONTAINS 'status' 
    OR changed_fields CONTAINS 'payment.status'
  )
WITH (
  full_document = 'update_lookup',
  full_document_before_change = 'when_available',
  max_await_time = '1 second',
  batch_size = 50
);

-- Multi-collection change stream with filtering
CREATE CHANGE STREAM inventory_and_orders
ON DATABASE ecommerce
WHERE 
  collection_name IN ('orders', 'inventory', 'products')
  AND (
    (collection_name = 'orders' AND operation_type = 'insert')
    OR (collection_name = 'inventory' AND changed_fields CONTAINS 'availableQuantity')
    OR (collection_name = 'products' AND changed_fields CONTAINS 'price')
  )
WITH (
  resume_after = '8264BEB9F3000000012B0229296E04'
);

-- Real-time order processing with change stream triggers
CREATE TRIGGER process_order_changes
ON CHANGE STREAM order_status_changes
FOR EACH CHANGE AS
BEGIN
  -- Route processing based on change type
  CASE change.operation_type
    WHEN 'insert' THEN
      -- New order created
      CALL process_new_order(change.full_document);

      -- Send notifications
      INSERT INTO notification_queue (
        recipient, 
        type, 
        message, 
        data
      )
      VALUES (
        change.full_document.customer.email,
        'order_confirmation',
        'Your order has been received',
        change.full_document
      );

    WHEN 'update' THEN
      -- Order updated - check what changed
      IF change.changed_fields CONTAINS 'status' THEN
        CALL process_status_change(
          change.full_document,
          change.update_description.updated_fields.status
        );
      END IF;

      IF change.changed_fields CONTAINS 'payment.status' THEN
        CALL process_payment_status_change(
          change.full_document,
          change.update_description.updated_fields['payment.status']
        );
      END IF;
  END CASE;

  -- Update processing metrics
  UPDATE change_stream_metrics 
  SET 
    events_processed = events_processed + 1,
    last_processed_at = CURRENT_TIMESTAMP
  WHERE stream_name = 'order_status_changes';
END;

-- Change stream analytics and monitoring
WITH change_stream_analytics AS (
  SELECT 
    stream_name,
    operation_type,
    collection_name,
    DATE_TRUNC('minute', change_timestamp) as minute_bucket,

    COUNT(*) as change_count,
    COUNT(DISTINCT document_key._id) as unique_documents,

    -- Processing latency analysis
    AVG(processing_time_ms) as avg_processing_time,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY processing_time_ms) as p95_processing_time,

    -- Change characteristics
    COUNT(*) FILTER (WHERE operation_type = 'insert') as insert_count,
    COUNT(*) FILTER (WHERE operation_type = 'update') as update_count,
    COUNT(*) FILTER (WHERE operation_type = 'delete') as delete_count,

    -- Field change patterns
    STRING_AGG(DISTINCT changed_fields, ',') as common_changed_fields,

    -- Error tracking
    COUNT(*) FILTER (WHERE processing_status = 'error') as error_count,
    COUNT(*) FILTER (WHERE processing_status = 'retry') as retry_count

  FROM change_stream_events
  WHERE change_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
  GROUP BY stream_name, operation_type, collection_name, minute_bucket
),
stream_performance AS (
  SELECT 
    stream_name,
    SUM(change_count) as total_changes,
    AVG(avg_processing_time) as overall_avg_processing_time,
    MAX(p95_processing_time) as max_p95_processing_time,

    -- Throughput analysis
    SUM(change_count) / 60.0 as changes_per_second,

    -- Error rates
    SUM(error_count) as total_errors,
    (SUM(error_count)::numeric / SUM(change_count)) * 100 as error_rate_percent,

    -- Change type distribution
    SUM(insert_count) as total_inserts,
    SUM(update_count) as total_updates, 
    SUM(delete_count) as total_deletes,

    -- Field change frequency
    COUNT(DISTINCT common_changed_fields) as unique_field_patterns,

    -- Performance assessment
    CASE 
      WHEN AVG(avg_processing_time) > 1000 THEN 'SLOW'
      WHEN AVG(avg_processing_time) > 500 THEN 'MODERATE'
      ELSE 'FAST'
    END as performance_rating,

    -- Health indicators
    CASE
      WHEN (SUM(error_count)::numeric / SUM(change_count)) > 0.05 THEN 'UNHEALTHY'
      WHEN (SUM(error_count)::numeric / SUM(change_count)) > 0.01 THEN 'WARNING' 
      ELSE 'HEALTHY'
    END as health_status

  FROM change_stream_analytics
  GROUP BY stream_name
)
SELECT 
  sp.stream_name,
  sp.total_changes,
  ROUND(sp.changes_per_second, 2) as changes_per_sec,
  ROUND(sp.overall_avg_processing_time, 1) as avg_processing_ms,
  ROUND(sp.max_p95_processing_time, 1) as max_p95_ms,
  sp.performance_rating,
  sp.health_status,

  -- Change breakdown
  sp.total_inserts,
  sp.total_updates,
  sp.total_deletes,

  -- Error analysis  
  sp.total_errors,
  ROUND(sp.error_rate_percent, 2) as error_rate_pct,

  -- Field change patterns
  sp.unique_field_patterns,

  -- Recommendations
  CASE 
    WHEN sp.performance_rating = 'SLOW' THEN 'Optimize change processing logic or increase resources'
    WHEN sp.error_rate_percent > 5 THEN 'Investigate error patterns and improve error handling'
    WHEN sp.changes_per_second > 1000 THEN 'Consider stream partitioning for better throughput'
    ELSE 'Performance within acceptable parameters'
  END as recommendation

FROM stream_performance sp
ORDER BY sp.total_changes DESC;

-- Advanced change stream query patterns
CREATE VIEW real_time_order_insights AS
WITH order_changes AS (
  SELECT 
    full_document.*,
    change_timestamp,
    operation_type,
    changed_fields,

    -- Calculate order lifecycle timing
    CASE 
      WHEN operation_type = 'insert' THEN 'order_created'
      WHEN changed_fields CONTAINS 'status' THEN 
        CONCAT('status_changed_to_', full_document.status)
      WHEN changed_fields CONTAINS 'payment.status' THEN
        CONCAT('payment_', full_document.payment.status) 
      ELSE 'other_update'
    END as change_event_type,

    -- Time-based analytics
    DATE_TRUNC('hour', change_timestamp) as hour_bucket,
    EXTRACT(DOW FROM change_timestamp) as day_of_week,
    EXTRACT(HOUR FROM change_timestamp) as hour_of_day,

    -- Order value categories
    CASE 
      WHEN full_document.financial.total >= 500 THEN 'high_value'
      WHEN full_document.financial.total >= 100 THEN 'medium_value'
      ELSE 'low_value'
    END as order_value_category,

    -- Customer segment analysis
    full_document.customer.loyaltyTier as customer_segment,

    -- Geographic analysis
    full_document.shipping.address.state as shipping_state,
    full_document.shipping.address.country as shipping_country

  FROM CHANGE_STREAM(orders)
  WHERE change_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
),
order_metrics AS (
  SELECT 
    hour_bucket,
    day_of_week,
    hour_of_day,
    change_event_type,
    order_value_category,
    customer_segment,
    shipping_state,

    COUNT(*) as event_count,
    COUNT(DISTINCT full_document._id) as unique_orders,
    AVG(full_document.financial.total) as avg_order_value,
    SUM(full_document.financial.total) as total_order_value,

    -- Conversion funnel analysis
    COUNT(*) FILTER (WHERE change_event_type = 'order_created') as orders_created,
    COUNT(*) FILTER (WHERE change_event_type = 'status_changed_to_confirmed') as orders_confirmed,
    COUNT(*) FILTER (WHERE change_event_type = 'status_changed_to_shipped') as orders_shipped,
    COUNT(*) FILTER (WHERE change_event_type = 'status_changed_to_delivered') as orders_delivered,
    COUNT(*) FILTER (WHERE change_event_type = 'status_changed_to_cancelled') as orders_cancelled,

    -- Payment analysis
    COUNT(*) FILTER (WHERE change_event_type = 'payment_authorized') as payments_authorized,
    COUNT(*) FILTER (WHERE change_event_type = 'payment_captured') as payments_captured,
    COUNT(*) FILTER (WHERE change_event_type = 'payment_failed') as payments_failed,

    -- Customer behavior
    COUNT(DISTINCT full_document.customer.customerId) as unique_customers,
    AVG(ARRAY_LENGTH(full_document.items, 1)) as avg_items_per_order

  FROM order_changes
  GROUP BY 
    hour_bucket, day_of_week, hour_of_day, change_event_type,
    order_value_category, customer_segment, shipping_state
)
SELECT 
  hour_bucket,
  change_event_type,
  order_value_category,
  customer_segment,

  event_count,
  unique_orders,
  ROUND(avg_order_value, 2) as avg_order_value,
  ROUND(total_order_value, 2) as total_order_value,

  -- Conversion rates
  CASE 
    WHEN orders_created > 0 THEN 
      ROUND((orders_confirmed::numeric / orders_created) * 100, 1)
    ELSE 0
  END as confirmation_rate_pct,

  CASE 
    WHEN orders_confirmed > 0 THEN
      ROUND((orders_shipped::numeric / orders_confirmed) * 100, 1) 
    ELSE 0
  END as fulfillment_rate_pct,

  CASE
    WHEN orders_shipped > 0 THEN
      ROUND((orders_delivered::numeric / orders_shipped) * 100, 1)
    ELSE 0  
  END as delivery_rate_pct,

  -- Payment success rates
  CASE
    WHEN payments_authorized > 0 THEN
      ROUND((payments_captured::numeric / payments_authorized) * 100, 1)
    ELSE 0
  END as payment_success_rate_pct,

  -- Business insights
  unique_customers,
  ROUND(avg_items_per_order, 1) as avg_items_per_order,

  -- Time-based patterns
  day_of_week,
  hour_of_day,

  -- Geographic insights
  shipping_state,

  -- Performance indicators
  CASE 
    WHEN change_event_type = 'order_created' AND event_count > 100 THEN 'HIGH_VOLUME'
    WHEN change_event_type = 'payment_failed' AND event_count > 10 THEN 'PAYMENT_ISSUES'
    WHEN change_event_type = 'status_changed_to_cancelled' AND event_count > 20 THEN 'HIGH_CANCELLATION'
    ELSE 'NORMAL'
  END as alert_status

FROM order_metrics
WHERE event_count > 0
ORDER BY hour_bucket DESC, event_count DESC;

-- Resume token management for change stream reliability
CREATE TABLE change_stream_resume_tokens (
  stream_name VARCHAR(255) PRIMARY KEY,
  resume_token TEXT NOT NULL,
  last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

  -- Stream configuration
  collection_name VARCHAR(255),
  database_name VARCHAR(255),
  filter_pipeline JSONB,

  -- Monitoring
  events_processed BIGINT DEFAULT 0,
  last_event_timestamp TIMESTAMP,
  stream_status VARCHAR(50) DEFAULT 'active',

  -- Performance tracking
  avg_processing_latency_ms INTEGER,
  last_error_message TEXT,
  last_error_timestamp TIMESTAMP,
  consecutive_errors INTEGER DEFAULT 0
);

-- Automatic resume token persistence
CREATE TRIGGER update_resume_tokens
AFTER INSERT OR UPDATE ON change_stream_events
FOR EACH ROW
EXECUTE FUNCTION update_stream_resume_token();

-- Change stream health monitoring
SELECT 
  cst.stream_name,
  cst.collection_name,
  cst.events_processed,
  cst.stream_status,

  -- Time since last activity
  EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - cst.last_event_timestamp)) / 60 as minutes_since_last_event,

  -- Performance metrics
  cst.avg_processing_latency_ms,
  cst.consecutive_errors,

  -- Health assessment
  CASE 
    WHEN cst.stream_status != 'active' THEN 'INACTIVE'
    WHEN cst.consecutive_errors >= 5 THEN 'FAILING'
    WHEN EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - cst.last_event_timestamp)) > 300 THEN 'STALE'
    WHEN cst.avg_processing_latency_ms > 1000 THEN 'SLOW'
    ELSE 'HEALTHY'
  END as health_status,

  -- Recovery information
  cst.resume_token,
  cst.last_updated,

  -- Error details
  cst.last_error_message,
  cst.last_error_timestamp

FROM change_stream_resume_tokens cst
ORDER BY 
  CASE health_status
    WHEN 'FAILING' THEN 1
    WHEN 'INACTIVE' THEN 2
    WHEN 'STALE' THEN 3
    WHEN 'SLOW' THEN 4
    ELSE 5
  END,
  cst.events_processed DESC;

-- QueryLeaf change stream features provide:
-- 1. SQL-familiar syntax for MongoDB Change Stream operations
-- 2. Real-time event processing with familiar trigger patterns
-- 3. Advanced filtering and transformation using SQL expressions
-- 4. Built-in analytics and monitoring with SQL aggregation functions
-- 5. Resume token management for reliable stream processing
-- 6. Performance monitoring and health assessment queries
-- 7. Integration with existing SQL-based reporting and analytics
-- 8. Event-driven architecture patterns using familiar SQL constructs
-- 9. Multi-collection change coordination with SQL joins and unions
-- 10. Seamless scaling from simple change detection to complex event processing

Best Practices for Change Stream Implementation

Performance and Scalability Considerations

Optimize Change Streams for high-throughput, production environments:

Pipeline Filtering: Use aggregation pipelines to filter changes at the database level
Resume Token Management: Implement robust resume token persistence for crash recovery
Batch Processing: Process changes in batches to improve throughput
Resource Management: Monitor memory and connection usage for long-running streams
Error Handling: Implement comprehensive error handling and retry logic
Oplog Sizing: Ensure adequate oplog size for change stream retention requirements

Event-Driven Architecture Patterns

Design scalable event-driven systems with Change Streams:

Event Sourcing: Use Change Streams as the foundation for event sourcing patterns
CQRS Integration: Implement Command Query Responsibility Segregation with change-driven read model updates
Microservice Communication: Coordinate microservices through change-driven events
Data Synchronization: Maintain consistency across distributed systems
Real-time Analytics: Power real-time dashboards and analytics with streaming changes
Audit and Compliance: Implement comprehensive audit trails with change event logging

Conclusion

MongoDB Change Streams provide comprehensive real-time change detection capabilities that eliminate the complexity and overhead of traditional polling-based approaches while enabling sophisticated event-driven architectures. The native integration with MongoDB's replica set architecture, combined with resumable streams and fine-grained filtering, makes building reactive applications both powerful and reliable.

Key Change Stream benefits include:

Real-time Processing: Millisecond latency change detection without polling overhead
Guaranteed Delivery: Ordered, resumable streams with crash recovery capabilities
Rich Filtering: Aggregation pipeline-based change filtering and transformation
Horizontal Scaling: Native support for distributed processing across multiple application instances
Operational Simplicity: No external message brokers or complex trigger maintenance required
Event Sourcing Support: Built-in capabilities for implementing event sourcing patterns

Whether you're building microservices architectures, real-time analytics platforms, data synchronization systems, or event-driven applications, MongoDB Change Streams with QueryLeaf's familiar SQL interface provides the foundation for sophisticated reactive data processing. This combination enables you to implement complex event-driven functionality while preserving familiar database interaction patterns.

QueryLeaf Integration: QueryLeaf automatically manages MongoDB Change Stream operations while providing SQL-familiar event processing syntax, resume token handling, and stream analytics functions. Advanced change detection, event routing, and stream monitoring are seamlessly handled through familiar SQL patterns, making event-driven architecture development both powerful and accessible.

The integration of native change detection capabilities with SQL-style stream processing makes MongoDB an ideal platform for applications requiring both real-time reactivity and familiar database interaction patterns, ensuring your event-driven solutions remain both effective and maintainable as they scale and evolve.

November 6, 2025
25 min read

MongoDB GridFS: Advanced Binary File Management and Distributed Storage for Large-Scale Applications

Modern applications require sophisticated file storage capabilities that can handle large binary files, support efficient streaming operations, and integrate seamlessly with existing data workflows while maintaining high availability and performance. Traditional file storage approaches often struggle with scenarios involving large files, distributed systems, metadata management, and the complexity of coordinating file operations with database transactions, leading to data inconsistency, performance bottlenecks, and operational complexity in production environments.

MongoDB GridFS provides comprehensive distributed file storage that automatically chunks large files, maintains file metadata, and integrates directly with MongoDB's distributed architecture and transaction capabilities. Unlike traditional file storage solutions that require separate file servers and complex synchronization logic, GridFS delivers unified file and data management through automatic file chunking, integrated metadata storage, and seamless integration with MongoDB's replication and sharding capabilities.

The Traditional File Storage Challenge

Conventional file storage architectures face significant limitations when handling large files and distributed systems:

-- Traditional PostgreSQL file storage - complex management and limited scalability

-- Basic file metadata table with limited binary storage capabilities
CREATE TABLE file_metadata (
    file_id BIGSERIAL PRIMARY KEY,
    original_filename VARCHAR(500) NOT NULL,
    content_type VARCHAR(200),
    file_size_bytes BIGINT NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- File organization
    directory_path VARCHAR(1000),
    file_category VARCHAR(100),

    -- User and access control
    uploaded_by BIGINT NOT NULL,
    access_level VARCHAR(20) DEFAULT 'private',

    -- File processing status
    processing_status VARCHAR(50) DEFAULT 'pending',
    thumbnail_generated BOOLEAN DEFAULT FALSE,
    virus_scan_status VARCHAR(50) DEFAULT 'pending',

    -- Storage location (external file system required)
    storage_path VARCHAR(1500) NOT NULL,
    storage_server VARCHAR(200),
    backup_locations TEXT[],

    -- File versioning (complex to implement)
    version_number INTEGER DEFAULT 1,
    parent_file_id BIGINT REFERENCES file_metadata(file_id),
    is_current_version BOOLEAN DEFAULT TRUE,

    -- Performance optimization fields
    download_count BIGINT DEFAULT 0,
    last_accessed TIMESTAMP,

    -- File integrity
    md5_hash VARCHAR(32),
    sha256_hash VARCHAR(64),

    -- Metadata for different file types
    image_metadata JSONB,
    document_metadata JSONB,
    video_metadata JSONB,

    CONSTRAINT valid_access_level CHECK (access_level IN ('public', 'private', 'shared', 'restricted')),
    CONSTRAINT valid_processing_status CHECK (processing_status IN ('pending', 'processing', 'completed', 'failed'))
);

-- Complex indexing strategy for file management
CREATE INDEX idx_files_user_category ON file_metadata(uploaded_by, file_category, created_at DESC);
CREATE INDEX idx_files_directory ON file_metadata(directory_path, original_filename);
CREATE INDEX idx_files_size ON file_metadata(file_size_bytes DESC);
CREATE INDEX idx_files_type ON file_metadata(content_type, created_at DESC);
CREATE INDEX idx_files_processing ON file_metadata(processing_status, created_at);

-- File chunks table for large file handling (manual implementation required)
CREATE TABLE file_chunks (
    chunk_id BIGSERIAL PRIMARY KEY,
    file_id BIGINT NOT NULL REFERENCES file_metadata(file_id) ON DELETE CASCADE,
    chunk_number INTEGER NOT NULL,
    chunk_size INTEGER NOT NULL,
    chunk_data BYTEA NOT NULL, -- Limited to 1GB per field in PostgreSQL
    chunk_hash VARCHAR(64),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    UNIQUE(file_id, chunk_number)
);

CREATE INDEX idx_chunks_file_order ON file_chunks(file_id, chunk_number);

-- Complex file upload procedure with chunking logic
CREATE OR REPLACE FUNCTION upload_large_file(
    p_filename VARCHAR(500),
    p_content_type VARCHAR(200),
    p_file_data BYTEA,
    p_uploaded_by BIGINT,
    p_directory_path VARCHAR(1000) DEFAULT '/',
    p_chunk_size INTEGER DEFAULT 1048576 -- 1MB chunks
) RETURNS TABLE (
    file_id BIGINT,
    total_chunks INTEGER,
    upload_status VARCHAR(50),
    processing_time_ms INTEGER
) AS $$
DECLARE
    new_file_id BIGINT;
    file_size BIGINT;
    chunk_count INTEGER;
    chunk_data BYTEA;
    chunk_start INTEGER;
    chunk_end INTEGER;
    current_chunk INTEGER := 1;
    upload_start_time TIMESTAMP := clock_timestamp();
    file_hash VARCHAR(64);
BEGIN

    -- Calculate file size and hash
    file_size := LENGTH(p_file_data);
    file_hash := encode(digest(p_file_data, 'sha256'), 'hex');

    -- Insert file metadata
    INSERT INTO file_metadata (
        original_filename, content_type, file_size_bytes,
        uploaded_by, directory_path, storage_path,
        sha256_hash, processing_status
    ) VALUES (
        p_filename, p_content_type, file_size,
        p_uploaded_by, p_directory_path, 
        p_directory_path || '/' || p_filename,
        file_hash, 'processing'
    ) RETURNING file_metadata.file_id INTO new_file_id;

    -- Calculate number of chunks needed
    chunk_count := CEILING(file_size::DECIMAL / p_chunk_size);

    -- Process file in chunks (inefficient for large files)
    FOR current_chunk IN 1..chunk_count LOOP
        chunk_start := ((current_chunk - 1) * p_chunk_size) + 1;
        chunk_end := LEAST(current_chunk * p_chunk_size, file_size);

        -- Extract chunk data (memory intensive)
        chunk_data := SUBSTRING(p_file_data FROM chunk_start FOR (chunk_end - chunk_start + 1));

        -- Store chunk
        INSERT INTO file_chunks (
            file_id, chunk_number, chunk_size, chunk_data,
            chunk_hash
        ) VALUES (
            new_file_id, current_chunk, LENGTH(chunk_data), chunk_data,
            encode(digest(chunk_data, 'sha256'), 'hex')
        );

        -- Performance degradation with large number of chunks
        IF current_chunk % 100 = 0 THEN
            COMMIT; -- Partial commits to avoid long transactions
        END IF;
    END LOOP;

    -- Update file status
    UPDATE file_metadata 
    SET processing_status = 'completed', updated_at = CURRENT_TIMESTAMP
    WHERE file_metadata.file_id = new_file_id;

    RETURN QUERY SELECT 
        new_file_id,
        chunk_count,
        'completed'::VARCHAR(50),
        EXTRACT(MILLISECONDS FROM clock_timestamp() - upload_start_time)::INTEGER;

EXCEPTION WHEN OTHERS THEN
    -- Cleanup on failure
    DELETE FROM file_metadata WHERE file_metadata.file_id = new_file_id;
    RAISE EXCEPTION 'File upload failed: %', SQLERRM;
END;
$$ LANGUAGE plpgsql;

-- Complex file download procedure with chunked retrieval
CREATE OR REPLACE FUNCTION download_file_chunks(
    p_file_id BIGINT,
    p_start_chunk INTEGER DEFAULT 1,
    p_end_chunk INTEGER DEFAULT NULL
) RETURNS TABLE (
    chunk_number INTEGER,
    chunk_data BYTEA,
    chunk_size INTEGER,
    is_final_chunk BOOLEAN
) AS $$
DECLARE
    total_chunks INTEGER;
    effective_end_chunk INTEGER;
BEGIN

    -- Get total number of chunks
    SELECT COUNT(*) INTO total_chunks
    FROM file_chunks 
    WHERE file_id = p_file_id;

    IF total_chunks = 0 THEN
        RAISE EXCEPTION 'File not found or has no chunks: %', p_file_id;
    END IF;

    -- Set effective end chunk
    effective_end_chunk := COALESCE(p_end_chunk, total_chunks);

    -- Return requested chunks (memory intensive for large ranges)
    RETURN QUERY
    SELECT 
        fc.chunk_number,
        fc.chunk_data,
        fc.chunk_size,
        fc.chunk_number = total_chunks as is_final_chunk
    FROM file_chunks fc
    WHERE fc.file_id = p_file_id
      AND fc.chunk_number BETWEEN p_start_chunk AND effective_end_chunk
    ORDER BY fc.chunk_number;

END;
$$ LANGUAGE plpgsql;

-- File streaming simulation with complex logic
CREATE OR REPLACE FUNCTION stream_file(
    p_file_id BIGINT,
    p_range_start BIGINT DEFAULT 0,
    p_range_end BIGINT DEFAULT NULL
) RETURNS TABLE (
    file_info JSONB,
    chunk_data BYTEA,
    content_range VARCHAR(100),
    total_size BIGINT
) AS $$
DECLARE
    file_record RECORD;
    chunk_size_bytes INTEGER := 1048576; -- 1MB chunks
    start_chunk INTEGER;
    end_chunk INTEGER;
    effective_range_end BIGINT;
    current_position BIGINT := 0;
    chunk_record RECORD;
BEGIN

    -- Get file metadata
    SELECT * INTO file_record
    FROM file_metadata fm
    WHERE fm.file_id = p_file_id;

    IF NOT FOUND THEN
        RAISE EXCEPTION 'File not found: %', p_file_id;
    END IF;

    -- Calculate effective range
    effective_range_end := COALESCE(p_range_end, file_record.file_size_bytes - 1);

    -- Calculate chunk range
    start_chunk := (p_range_start / chunk_size_bytes) + 1;
    end_chunk := (effective_range_end / chunk_size_bytes) + 1;

    -- Return file info
    file_info := json_build_object(
        'file_id', file_record.file_id,
        'filename', file_record.original_filename,
        'content_type', file_record.content_type,
        'total_size', file_record.file_size_bytes,
        'range_start', p_range_start,
        'range_end', effective_range_end
    );

    -- Stream chunks (inefficient for large files)
    FOR chunk_record IN
        SELECT fc.chunk_number, fc.chunk_data, fc.chunk_size
        FROM file_chunks fc
        WHERE fc.file_id = p_file_id
          AND fc.chunk_number BETWEEN start_chunk AND end_chunk
        ORDER BY fc.chunk_number
    LOOP

        -- Calculate partial chunk data if needed
        IF chunk_record.chunk_number = start_chunk AND p_range_start % chunk_size_bytes != 0 THEN
            -- Partial first chunk
            chunk_data := SUBSTRING(
                chunk_record.chunk_data 
                FROM (p_range_start % chunk_size_bytes) + 1
            );
        ELSIF chunk_record.chunk_number = end_chunk AND effective_range_end % chunk_size_bytes != chunk_size_bytes - 1 THEN
            -- Partial last chunk
            chunk_data := SUBSTRING(
                chunk_record.chunk_data 
                FOR (effective_range_end % chunk_size_bytes) + 1
            );
        ELSE
            -- Full chunk
            chunk_data := chunk_record.chunk_data;
        END IF;

        content_range := format('bytes %s-%s/%s', 
            current_position, 
            current_position + LENGTH(chunk_data) - 1,
            file_record.file_size_bytes
        );

        total_size := file_record.file_size_bytes;

        current_position := current_position + LENGTH(chunk_data);

        RETURN NEXT;
    END LOOP;

END;
$$ LANGUAGE plpgsql;

-- Complex analytics query for file storage management
WITH file_storage_analysis AS (
    SELECT 
        file_category,
        content_type,
        DATE_TRUNC('month', created_at) as month_bucket,

        -- Storage utilization
        COUNT(*) as total_files,
        SUM(file_size_bytes) as total_storage_bytes,
        AVG(file_size_bytes) as avg_file_size,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY file_size_bytes) as median_file_size,
        PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY file_size_bytes) as p95_file_size,

        -- Performance metrics
        AVG(download_count) as avg_downloads,
        SUM(download_count) as total_downloads,
        COUNT(*) FILTER (WHERE download_count = 0) as unused_files,

        -- Processing status
        COUNT(*) FILTER (WHERE processing_status = 'completed') as processed_files,
        COUNT(*) FILTER (WHERE processing_status = 'failed') as failed_files,
        COUNT(*) FILTER (WHERE thumbnail_generated = true) as files_with_thumbnails,

        -- Storage efficiency
        COUNT(DISTINCT uploaded_by) as unique_uploaders,
        AVG(version_number) as avg_version_number

    FROM file_metadata
    WHERE created_at >= CURRENT_TIMESTAMP - INTERVAL '12 months'
    GROUP BY file_category, content_type, DATE_TRUNC('month', created_at)
),

storage_growth_projection AS (
    SELECT 
        month_bucket,
        total_storage_bytes,

        -- Growth calculations (complex and expensive)
        LAG(total_storage_bytes) OVER (ORDER BY month_bucket) as prev_month_storage,
        (total_storage_bytes - LAG(total_storage_bytes) OVER (ORDER BY month_bucket))::DECIMAL / 
        NULLIF(LAG(total_storage_bytes) OVER (ORDER BY month_bucket), 0) * 100 as growth_percent

    FROM (
        SELECT 
            month_bucket,
            SUM(total_storage_bytes) as total_storage_bytes
        FROM file_storage_analysis
        GROUP BY month_bucket
    ) monthly_totals
)

SELECT 
    fsa.month_bucket,
    fsa.file_category,
    fsa.content_type,

    -- File statistics
    fsa.total_files,
    ROUND(fsa.total_storage_bytes / 1024.0 / 1024.0 / 1024.0, 2) as storage_gb,
    ROUND(fsa.avg_file_size / 1024.0 / 1024.0, 2) as avg_file_size_mb,
    ROUND(fsa.median_file_size / 1024.0 / 1024.0, 2) as median_file_size_mb,

    -- Usage patterns
    fsa.avg_downloads,
    fsa.total_downloads,
    ROUND((fsa.unused_files::DECIMAL / fsa.total_files) * 100, 1) as unused_files_percent,

    -- Processing efficiency
    ROUND((fsa.processed_files::DECIMAL / fsa.total_files) * 100, 1) as processing_success_rate,
    ROUND((fsa.files_with_thumbnails::DECIMAL / fsa.total_files) * 100, 1) as thumbnail_generation_rate,

    -- Growth metrics
    sgp.growth_percent as monthly_growth_percent,

    -- Storage recommendations
    CASE 
        WHEN fsa.unused_files::DECIMAL / fsa.total_files > 0.5 THEN 'implement_cleanup_policy'
        WHEN fsa.avg_file_size > 100 * 1024 * 1024 THEN 'consider_compression'
        WHEN sgp.growth_percent > 50 THEN 'monitor_storage_capacity'
        ELSE 'storage_optimized'
    END as storage_recommendation

FROM file_storage_analysis fsa
JOIN storage_growth_projection sgp ON DATE_TRUNC('month', fsa.month_bucket) = sgp.month_bucket
WHERE fsa.total_files > 0
ORDER BY fsa.month_bucket DESC, fsa.total_storage_bytes DESC;

-- Traditional file storage approach problems:
-- 1. Complex manual chunking implementation with performance limitations
-- 2. Separate metadata and binary data management requiring coordination
-- 3. Limited streaming capabilities and memory-intensive operations
-- 4. No built-in distributed storage or replication support
-- 5. Complex versioning and concurrent access management
-- 6. Expensive maintenance operations for large file collections
-- 7. No native integration with database transactions and consistency
-- 8. Limited file processing and metadata extraction capabilities
-- 9. Difficult backup and disaster recovery for large binary datasets
-- 10. Complex sharding and distribution strategies for file data

MongoDB GridFS provides comprehensive distributed file storage with automatic chunking and metadata management:

// MongoDB GridFS - Advanced distributed file storage with automatic chunking and metadata management
const { MongoClient, GridFSBucket, ObjectId } = require('mongodb');
const fs = require('fs');
const crypto = require('crypto');
const { Transform, Readable } = require('stream');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('advanced_file_storage');

// Comprehensive MongoDB GridFS Manager
class AdvancedGridFSManager {
  constructor(db, config = {}) {
    this.db = db;
    this.config = {
      // Default GridFS configuration
      defaultBucketName: config.defaultBucketName || 'fs',
      defaultChunkSizeBytes: config.defaultChunkSizeBytes || 255 * 1024, // 255KB

      // Performance optimization
      enableConcurrentUploads: config.enableConcurrentUploads !== false,
      maxConcurrentUploads: config.maxConcurrentUploads || 10,
      enableStreamOptimization: config.enableStreamOptimization !== false,
      bufferSize: config.bufferSize || 64 * 1024,

      // File processing features
      enableHashGeneration: config.enableHashGeneration !== false,
      enableMetadataExtraction: config.enableMetadataExtraction !== false,
      enableThumbnailGeneration: config.enableThumbnailGeneration !== false,
      enableContentAnalysis: config.enableContentAnalysis !== false,

      // Storage optimization
      enableCompression: config.enableCompression !== false,
      compressionLevel: config.compressionLevel || 6,
      enableDeduplication: config.enableDeduplication !== false,

      // Access control and security
      enableEncryption: config.enableEncryption !== false,
      encryptionKey: config.encryptionKey,
      enableAccessLogging: config.enableAccessLogging !== false,

      // Performance monitoring
      enablePerformanceMetrics: config.enablePerformanceMetrics !== false,
      enableUsageAnalytics: config.enableUsageAnalytics !== false,

      // Advanced features
      enableVersioning: config.enableVersioning !== false,
      enableDistributedStorage: config.enableDistributedStorage !== false,
      enableAutoCleanup: config.enableAutoCleanup !== false
    };

    // GridFS buckets for different file types
    this.buckets = new Map();
    this.uploadStreams = new Map();
    this.downloadStreams = new Map();

    // Performance tracking
    this.performanceMetrics = {
      totalUploads: 0,
      totalDownloads: 0,
      totalStorageBytes: 0,
      averageUploadTime: 0,
      averageDownloadTime: 0,
      errorCount: 0
    };

    // File processing queues
    this.processingQueue = new Map();
    this.thumbnailQueue = new Map();

    this.initializeGridFS();
  }

  async initializeGridFS() {
    console.log('Initializing advanced GridFS file storage system...');

    try {
      // Create specialized GridFS buckets for different file types
      await this.createOptimizedBucket('documents', {
        chunkSizeBytes: 512 * 1024, // 512KB chunks for documents
        metadata: {
          purpose: 'document_storage',
          contentTypes: ['application/pdf', 'application/msword', 'text/plain'],
          enableFullTextIndex: true,
          enableContentExtraction: true
        }
      });

      await this.createOptimizedBucket('images', {
        chunkSizeBytes: 256 * 1024, // 256KB chunks for images
        metadata: {
          purpose: 'image_storage',
          contentTypes: ['image/jpeg', 'image/png', 'image/gif', 'image/webp'],
          enableThumbnailGeneration: true,
          enableImageAnalysis: true
        }
      });

      await this.createOptimizedBucket('videos', {
        chunkSizeBytes: 1024 * 1024, // 1MB chunks for videos
        metadata: {
          purpose: 'video_storage',
          contentTypes: ['video/mp4', 'video/webm', 'video/avi'],
          enableVideoProcessing: true,
          enableStreamingOptimization: true
        }
      });

      await this.createOptimizedBucket('archives', {
        chunkSizeBytes: 2 * 1024 * 1024, // 2MB chunks for archives
        metadata: {
          purpose: 'archive_storage',
          contentTypes: ['application/zip', 'application/tar', 'application/gzip'],
          enableCompression: false, // Already compressed
          enableIntegrityCheck: true
        }
      });

      // Create general-purpose bucket
      await this.createOptimizedBucket('general', {
        chunkSizeBytes: this.config.defaultChunkSizeBytes,
        metadata: {
          purpose: 'general_storage',
          enableGenericProcessing: true
        }
      });

      // Setup performance monitoring
      if (this.config.enablePerformanceMetrics) {
        await this.setupPerformanceMonitoring();
      }

      // Setup automatic cleanup
      if (this.config.enableAutoCleanup) {
        await this.setupAutomaticCleanup();
      }

      console.log('Advanced GridFS system initialized successfully');

    } catch (error) {
      console.error('Error initializing GridFS:', error);
      throw error;
    }
  }

  async createOptimizedBucket(bucketName, options) {
    console.log(`Creating optimized GridFS bucket: ${bucketName}...`);

    try {
      const bucket = new GridFSBucket(this.db, {
        bucketName: bucketName,
        chunkSizeBytes: options.chunkSizeBytes || this.config.defaultChunkSizeBytes,
        writeConcern: { w: 1, j: true },
        readConcern: { level: 'majority' }
      });

      this.buckets.set(bucketName, {
        bucket: bucket,
        options: options,
        created: new Date(),
        stats: {
          fileCount: 0,
          totalSize: 0,
          uploadsInProgress: 0
        }
      });

      // Create optimized indexes for GridFS collections
      await this.createGridFSIndexes(bucketName);

      console.log(`GridFS bucket ${bucketName} created with ${options.chunkSizeBytes} byte chunks`);

    } catch (error) {
      console.error(`Error creating GridFS bucket ${bucketName}:`, error);
      throw error;
    }
  }

  async createGridFSIndexes(bucketName) {
    console.log(`Creating optimized indexes for GridFS bucket: ${bucketName}...`);

    try {
      // Files collection indexes
      const filesCollection = this.db.collection(`${bucketName}.files`);
      await filesCollection.createIndexes([
        { key: { filename: 1, uploadDate: -1 }, background: true, name: 'filename_upload_date' },
        { key: { 'metadata.contentType': 1, uploadDate: -1 }, background: true, name: 'content_type_date' },
        { key: { 'metadata.userId': 1, uploadDate: -1 }, background: true, sparse: true, name: 'user_files' },
        { key: { 'metadata.tags': 1 }, background: true, sparse: true, name: 'file_tags' },
        { key: { length: -1, uploadDate: -1 }, background: true, name: 'size_date' },
        { key: { 'metadata.hash': 1 }, background: true, sparse: true, name: 'file_hash' }
      ]);

      // Chunks collection indexes (automatically created by GridFS, but we can add custom ones)
      const chunksCollection = this.db.collection(`${bucketName}.chunks`);
      await chunksCollection.createIndexes([
        // Default GridFS index: { files_id: 1, n: 1 } is automatically created
        { key: { files_id: 1 }, background: true, name: 'chunks_file_id' }
      ]);

      console.log(`GridFS indexes created for bucket: ${bucketName}`);

    } catch (error) {
      console.error(`Error creating GridFS indexes for ${bucketName}:`, error);
      // Don't fail initialization for index creation issues
    }
  }

  async uploadFile(bucketName, filename, fileStream, metadata = {}) {
    console.log(`Starting file upload: ${filename} to bucket: ${bucketName}`);
    const uploadStartTime = Date.now();

    try {
      const bucketInfo = this.buckets.get(bucketName);
      if (!bucketInfo) {
        throw new Error(`GridFS bucket ${bucketName} not found`);
      }

      const bucket = bucketInfo.bucket;

      // Generate file hash for deduplication and integrity
      const hashStream = crypto.createHash('sha256');
      let fileSize = 0;

      // Enhanced metadata with automatic enrichment
      const enhancedMetadata = {
        ...metadata,

        // Upload context
        uploadedAt: new Date(),
        uploadedBy: metadata.userId || 'system',
        bucketName: bucketName,

        // File identification
        originalFilename: filename,
        contentType: metadata.contentType || this.detectContentType(filename),

        // Processing flags
        processingStatus: 'pending',
        processingQueue: [],

        // Access and security
        accessLevel: metadata.accessLevel || 'private',
        encryptionStatus: this.config.enableEncryption ? 'encrypted' : 'unencrypted',

        // File categorization
        category: this.categorizeFile(filename, metadata.contentType),
        tags: metadata.tags || [],

        // Version control
        version: metadata.version || 1,
        parentFileId: metadata.parentFileId,

        // System metadata
        source: metadata.source || 'api_upload',
        clientInfo: metadata.clientInfo || {},

        // Performance tracking
        uploadMetrics: {
          startTime: uploadStartTime,
          chunkSizeBytes: bucketInfo.options.chunkSizeBytes
        }
      };

      // Create upload stream with optimization
      const uploadOptions = {
        metadata: enhancedMetadata,
        chunkSizeBytes: bucketInfo.options.chunkSizeBytes,
        disableMD5: false // Enable MD5 for integrity checking
      };

      const uploadStream = bucket.openUploadStream(filename, uploadOptions);
      const uploadId = uploadStream.id.toString();

      // Track upload progress
      this.uploadStreams.set(uploadId, {
        stream: uploadStream,
        filename: filename,
        startTime: uploadStartTime,
        bucketName: bucketName
      });

      // Setup progress tracking and error handling
      let uploadedBytes = 0;

      return new Promise((resolve, reject) => {
        uploadStream.on('error', (error) => {
          console.error(`Upload error for ${filename}:`, error);
          this.uploadStreams.delete(uploadId);
          this.performanceMetrics.errorCount++;
          reject(error);
        });

        uploadStream.on('finish', async () => {
          const uploadTime = Date.now() - uploadStartTime;

          try {
            // Update file metadata with hash and final processing info
            const finalMetadata = {
              ...enhancedMetadata,

              // File integrity
              hash: hashStream.digest('hex'),
              fileSize: fileSize,

              // Upload completion
              processingStatus: 'uploaded',
              uploadMetrics: {
                ...enhancedMetadata.uploadMetrics,
                completedAt: new Date(),
                uploadTimeMs: uploadTime,
                throughputBytesPerSecond: fileSize > 0 ? Math.round(fileSize / (uploadTime / 1000)) : 0
              }
            };

            // Update the file document with enhanced metadata
            await this.db.collection(`${bucketName}.files`).updateOne(
              { _id: uploadStream.id },
              { 
                $set: { 
                  metadata: finalMetadata,
                  'metadata.hash': finalMetadata.hash,
                  'metadata.fileSize': finalMetadata.fileSize
                }
              }
            );

            // Update performance metrics
            this.updatePerformanceMetrics('upload', uploadTime, fileSize);
            bucketInfo.stats.fileCount++;
            bucketInfo.stats.totalSize += fileSize;

            // Queue for post-processing
            if (this.config.enableMetadataExtraction || this.config.enableThumbnailGeneration) {
              await this.queueFileProcessing(uploadStream.id, bucketName, finalMetadata);
            }

            this.uploadStreams.delete(uploadId);

            console.log(`File upload completed: ${filename} (${fileSize} bytes) in ${uploadTime}ms`);

            resolve({
              success: true,
              fileId: uploadStream.id,
              filename: filename,
              size: fileSize,
              hash: finalMetadata.hash,
              uploadTime: uploadTime,
              bucketName: bucketName,
              metadata: finalMetadata
            });

          } catch (error) {
            console.error('Error updating file metadata after upload:', error);
            reject(error);
          }
        });

        // Pipe the file stream through hash calculation and to GridFS
        fileStream.on('data', (chunk) => {
          hashStream.update(chunk);
          fileSize += chunk.length;
          uploadedBytes += chunk.length;

          // Report progress for large files
          if (uploadedBytes % (1024 * 1024) === 0) { // Every MB
            console.log(`Upload progress: ${filename} - ${Math.round(uploadedBytes / 1024 / 1024)}MB`);
          }
        });

        fileStream.pipe(uploadStream);
      });

    } catch (error) {
      console.error(`Error uploading file ${filename}:`, error);
      this.performanceMetrics.errorCount++;

      return {
        success: false,
        error: error.message,
        filename: filename,
        bucketName: bucketName
      };
    }
  }

  async downloadFile(bucketName, fileId, options = {}) {
    console.log(`Starting file download: ${fileId} from bucket: ${bucketName}`);
    const downloadStartTime = Date.now();

    try {
      const bucketInfo = this.buckets.get(bucketName);
      if (!bucketInfo) {
        throw new Error(`GridFS bucket ${bucketName} not found`);
      }

      const bucket = bucketInfo.bucket;
      const objectId = new ObjectId(fileId);

      // Get file metadata first
      const fileInfo = await this.db.collection(`${bucketName}.files`).findOne({ _id: objectId });
      if (!fileInfo) {
        throw new Error(`File not found: ${fileId}`);
      }

      // Log access if enabled
      if (this.config.enableAccessLogging) {
        await this.logFileAccess(fileId, bucketName, 'download', options.userId);
      }

      // Create download stream with range support
      const downloadOptions = {};

      if (options.range) {
        downloadOptions.start = options.range.start || 0;
        downloadOptions.end = options.range.end || fileInfo.length - 1;
      }

      const downloadStream = bucket.openDownloadStream(objectId, downloadOptions);
      const downloadId = new ObjectId().toString();

      // Track download
      this.downloadStreams.set(downloadId, {
        stream: downloadStream,
        fileId: fileId,
        filename: fileInfo.filename,
        startTime: downloadStartTime,
        bucketName: bucketName
      });

      // Setup progress tracking
      let downloadedBytes = 0;

      downloadStream.on('data', (chunk) => {
        downloadedBytes += chunk.length;

        // Report progress for large files
        if (downloadedBytes % (1024 * 1024) === 0) { // Every MB
          console.log(`Download progress: ${fileInfo.filename} - ${Math.round(downloadedBytes / 1024 / 1024)}MB`);
        }
      });

      downloadStream.on('end', () => {
        const downloadTime = Date.now() - downloadStartTime;

        // Update metrics
        this.updatePerformanceMetrics('download', downloadTime, downloadedBytes);
        this.downloadStreams.delete(downloadId);

        console.log(`File download completed: ${fileInfo.filename} (${downloadedBytes} bytes) in ${downloadTime}ms`);
      });

      downloadStream.on('error', (error) => {
        console.error(`Download error for ${fileId}:`, error);
        this.downloadStreams.delete(downloadId);
        this.performanceMetrics.errorCount++;
      });

      return {
        success: true,
        fileId: fileId,
        filename: fileInfo.filename,
        contentType: fileInfo.metadata?.contentType || 'application/octet-stream',
        fileSize: fileInfo.length,
        downloadStream: downloadStream,
        metadata: fileInfo.metadata,
        bucketName: bucketName
      };

    } catch (error) {
      console.error(`Error downloading file ${fileId}:`, error);
      this.performanceMetrics.errorCount++;

      return {
        success: false,
        error: error.message,
        fileId: fileId,
        bucketName: bucketName
      };
    }
  }

  async streamFileRange(bucketName, fileId, rangeStart, rangeEnd, options = {}) {
    console.log(`Streaming file range: ${fileId} bytes ${rangeStart}-${rangeEnd}`);

    try {
      const bucketInfo = this.buckets.get(bucketName);
      if (!bucketInfo) {
        throw new Error(`GridFS bucket ${bucketName} not found`);
      }

      const bucket = bucketInfo.bucket;
      const objectId = new ObjectId(fileId);

      // Get file info for validation
      const fileInfo = await this.db.collection(`${bucketName}.files`).findOne({ _id: objectId });
      if (!fileInfo) {
        throw new Error(`File not found: ${fileId}`);
      }

      // Validate range
      const fileSize = fileInfo.length;
      const validatedRangeStart = Math.max(0, rangeStart);
      const validatedRangeEnd = Math.min(rangeEnd || fileSize - 1, fileSize - 1);

      if (validatedRangeStart > validatedRangeEnd) {
        throw new Error('Invalid range: start position greater than end position');
      }

      // Create range download stream
      const downloadStream = bucket.openDownloadStream(objectId, {
        start: validatedRangeStart,
        end: validatedRangeEnd
      });

      // Log access
      if (this.config.enableAccessLogging) {
        await this.logFileAccess(fileId, bucketName, 'stream', options.userId, {
          rangeStart: validatedRangeStart,
          rangeEnd: validatedRangeEnd,
          rangeSize: validatedRangeEnd - validatedRangeStart + 1
        });
      }

      return {
        success: true,
        fileId: fileId,
        filename: fileInfo.filename,
        contentType: fileInfo.metadata?.contentType || 'application/octet-stream',
        totalSize: fileSize,
        rangeStart: validatedRangeStart,
        rangeEnd: validatedRangeEnd,
        rangeSize: validatedRangeEnd - validatedRangeStart + 1,
        downloadStream: downloadStream,
        contentRange: `bytes ${validatedRangeStart}-${validatedRangeEnd}/${fileSize}`
      };

    } catch (error) {
      console.error(`Error streaming file range for ${fileId}:`, error);
      return {
        success: false,
        error: error.message,
        fileId: fileId
      };
    }
  }

  async deleteFile(bucketName, fileId, options = {}) {
    console.log(`Deleting file: ${fileId} from bucket: ${bucketName}`);

    try {
      const bucketInfo = this.buckets.get(bucketName);
      if (!bucketInfo) {
        throw new Error(`GridFS bucket ${bucketName} not found`);
      }

      const bucket = bucketInfo.bucket;
      const objectId = new ObjectId(fileId);

      // Get file info before deletion (for logging and stats)
      const fileInfo = await this.db.collection(`${bucketName}.files`).findOne({ _id: objectId });
      if (!fileInfo) {
        throw new Error(`File not found: ${fileId}`);
      }

      // Check permissions if needed
      if (options.userId && fileInfo.metadata?.uploadedBy !== options.userId) {
        if (!options.bypassPermissions) {
          throw new Error('Insufficient permissions to delete file');
        }
      }

      // Delete file and all associated chunks
      await bucket.delete(objectId);

      // Update bucket stats
      bucketInfo.stats.fileCount = Math.max(0, bucketInfo.stats.fileCount - 1);
      bucketInfo.stats.totalSize = Math.max(0, bucketInfo.stats.totalSize - fileInfo.length);

      // Log deletion
      if (this.config.enableAccessLogging) {
        await this.logFileAccess(fileId, bucketName, 'delete', options.userId, {
          filename: fileInfo.filename,
          fileSize: fileInfo.length,
          deletedBy: options.userId || 'system'
        });
      }

      console.log(`File deleted successfully: ${fileInfo.filename} (${fileInfo.length} bytes)`);

      return {
        success: true,
        fileId: fileId,
        filename: fileInfo.filename,
        fileSize: fileInfo.length,
        bucketName: bucketName
      };

    } catch (error) {
      console.error(`Error deleting file ${fileId}:`, error);
      this.performanceMetrics.errorCount++;

      return {
        success: false,
        error: error.message,
        fileId: fileId,
        bucketName: bucketName
      };
    }
  }

  async findFiles(bucketName, query = {}, options = {}) {
    console.log(`Searching files in bucket: ${bucketName}`);

    try {
      const bucketInfo = this.buckets.get(bucketName);
      if (!bucketInfo) {
        throw new Error(`GridFS bucket ${bucketName} not found`);
      }

      const filesCollection = this.db.collection(`${bucketName}.files`);

      // Build MongoDB query from search parameters
      const mongoQuery = {};

      if (query.filename) {
        mongoQuery.filename = new RegExp(query.filename, 'i');
      }

      if (query.contentType) {
        mongoQuery['metadata.contentType'] = query.contentType;
      }

      if (query.userId) {
        mongoQuery['metadata.uploadedBy'] = query.userId;
      }

      if (query.tags && query.tags.length > 0) {
        mongoQuery['metadata.tags'] = { $in: query.tags };
      }

      if (query.dateRange) {
        mongoQuery.uploadDate = {
          $gte: query.dateRange.start,
          $lte: query.dateRange.end || new Date()
        };
      }

      if (query.sizeRange) {
        mongoQuery.length = {};
        if (query.sizeRange.min) mongoQuery.length.$gte = query.sizeRange.min;
        if (query.sizeRange.max) mongoQuery.length.$lte = query.sizeRange.max;
      }

      // Configure query options
      const queryOptions = {
        sort: options.sort || { uploadDate: -1 },
        limit: options.limit || 100,
        skip: options.skip || 0,
        projection: options.includeMetadata ? {} : { 
          filename: 1, 
          length: 1, 
          uploadDate: 1, 
          'metadata.contentType': 1,
          'metadata.category': 1,
          'metadata.tags': 1
        }
      };

      // Execute query
      const files = await filesCollection.find(mongoQuery, queryOptions).toArray();
      const totalCount = await filesCollection.countDocuments(mongoQuery);

      return {
        success: true,
        files: files.map(file => ({
          fileId: file._id.toString(),
          filename: file.filename,
          contentType: file.metadata?.contentType,
          fileSize: file.length,
          uploadDate: file.uploadDate,
          category: file.metadata?.category,
          tags: file.metadata?.tags || [],
          hash: file.metadata?.hash,
          metadata: options.includeMetadata ? file.metadata : undefined
        })),
        totalCount: totalCount,
        currentPage: Math.floor((options.skip || 0) / (options.limit || 100)) + 1,
        totalPages: Math.ceil(totalCount / (options.limit || 100)),
        query: query,
        bucketName: bucketName
      };

    } catch (error) {
      console.error(`Error finding files in ${bucketName}:`, error);
      return {
        success: false,
        error: error.message,
        bucketName: bucketName
      };
    }
  }

  categorizeFile(filename, contentType) {
    // Intelligent file categorization
    const extension = filename.toLowerCase().split('.').pop();

    if (contentType) {
      if (contentType.startsWith('image/')) return 'image';
      if (contentType.startsWith('video/')) return 'video';
      if (contentType.startsWith('audio/')) return 'audio';
      if (contentType.includes('pdf')) return 'document';
      if (contentType.includes('text/')) return 'text';
    }

    // Extension-based categorization
    const imageExts = ['jpg', 'jpeg', 'png', 'gif', 'bmp', 'webp', 'svg'];
    const videoExts = ['mp4', 'avi', 'mov', 'wmv', 'flv', 'webm'];
    const audioExts = ['mp3', 'wav', 'flac', 'aac', 'ogg'];
    const documentExts = ['pdf', 'doc', 'docx', 'xls', 'xlsx', 'ppt', 'pptx'];
    const archiveExts = ['zip', 'tar', 'gz', 'rar', '7z'];

    if (imageExts.includes(extension)) return 'image';
    if (videoExts.includes(extension)) return 'video';
    if (audioExts.includes(extension)) return 'audio';
    if (documentExts.includes(extension)) return 'document';
    if (archiveExts.includes(extension)) return 'archive';

    return 'other';
  }

  detectContentType(filename) {
    // Simple content type detection based on extension
    const extension = filename.toLowerCase().split('.').pop();
    const contentTypes = {
      'jpg': 'image/jpeg', 'jpeg': 'image/jpeg',
      'png': 'image/png', 'gif': 'image/gif',
      'pdf': 'application/pdf',
      'txt': 'text/plain', 'html': 'text/html',
      'mp4': 'video/mp4', 'webm': 'video/webm',
      'mp3': 'audio/mpeg', 'wav': 'audio/wav',
      'zip': 'application/zip',
      'json': 'application/json'
    };

    return contentTypes[extension] || 'application/octet-stream';
  }

  async logFileAccess(fileId, bucketName, action, userId, additionalInfo = {}) {
    if (!this.config.enableAccessLogging) return;

    try {
      const accessLog = {
        fileId: new ObjectId(fileId),
        bucketName: bucketName,
        action: action, // upload, download, delete, stream
        userId: userId,
        timestamp: new Date(),
        ...additionalInfo,

        // System context
        userAgent: additionalInfo.userAgent,
        ipAddress: additionalInfo.ipAddress,
        sessionId: additionalInfo.sessionId,

        // Performance context
        responseTime: additionalInfo.responseTime,
        bytesTransferred: additionalInfo.bytesTransferred
      };

      await this.db.collection('file_access_logs').insertOne(accessLog);

    } catch (error) {
      console.error('Error logging file access:', error);
      // Don't fail the operation for logging errors
    }
  }

  updatePerformanceMetrics(operation, duration, bytes = 0) {
    if (!this.config.enablePerformanceMetrics) return;

    if (operation === 'upload') {
      this.performanceMetrics.totalUploads++;
      this.performanceMetrics.averageUploadTime = 
        (this.performanceMetrics.averageUploadTime + duration) / 2;
    } else if (operation === 'download') {
      this.performanceMetrics.totalDownloads++;
      this.performanceMetrics.averageDownloadTime = 
        (this.performanceMetrics.averageDownloadTime + duration) / 2;
    }

    this.performanceMetrics.totalStorageBytes += bytes;
  }

  async getStorageStats() {
    console.log('Gathering GridFS storage statistics...');

    const stats = {
      buckets: {},
      systemStats: this.performanceMetrics,
      summary: {
        totalBuckets: this.buckets.size,
        activeUploads: this.uploadStreams.size,
        activeDownloads: this.downloadStreams.size
      }
    };

    for (const [bucketName, bucketInfo] of this.buckets.entries()) {
      try {
        // Get collection statistics
        const filesCollection = this.db.collection(`${bucketName}.files`);
        const chunksCollection = this.db.collection(`${bucketName}.chunks`);

        const [filesStats, chunksStats, fileCount, totalSize] = await Promise.all([
          filesCollection.stats().catch(() => ({})),
          chunksCollection.stats().catch(() => ({})),
          filesCollection.countDocuments({}),
          filesCollection.aggregate([
            { $group: { _id: null, totalSize: { $sum: '$length' } } }
          ]).toArray()
        ]);

        stats.buckets[bucketName] = {
          configuration: bucketInfo.options,
          fileCount: fileCount,
          totalSizeBytes: totalSize[0]?.totalSize || 0,
          totalSizeMB: Math.round((totalSize[0]?.totalSize || 0) / 1024 / 1024),
          filesCollectionStats: {
            size: filesStats.size || 0,
            storageSize: filesStats.storageSize || 0,
            indexSize: filesStats.totalIndexSize || 0
          },
          chunksCollectionStats: {
            size: chunksStats.size || 0,
            storageSize: chunksStats.storageSize || 0,
            indexSize: chunksStats.totalIndexSize || 0
          },
          chunkSizeBytes: bucketInfo.options.chunkSizeBytes,
          averageFileSize: fileCount > 0 ? Math.round((totalSize[0]?.totalSize || 0) / fileCount) : 0,
          created: bucketInfo.created
        };

      } catch (error) {
        stats.buckets[bucketName] = {
          error: error.message,
          available: false
        };
      }
    }

    return stats;
  }

  async shutdown() {
    console.log('Shutting down GridFS manager...');

    // Close all active upload streams
    for (const [uploadId, uploadInfo] of this.uploadStreams.entries()) {
      try {
        uploadInfo.stream.destroy();
        console.log(`Closed upload stream: ${uploadId}`);
      } catch (error) {
        console.error(`Error closing upload stream ${uploadId}:`, error);
      }
    }

    // Close all active download streams
    for (const [downloadId, downloadInfo] of this.downloadStreams.entries()) {
      try {
        downloadInfo.stream.destroy();
        console.log(`Closed download stream: ${downloadId}`);
      } catch (error) {
        console.error(`Error closing download stream ${downloadId}:`, error);
      }
    }

    // Clear collections and metrics
    this.buckets.clear();
    this.uploadStreams.clear();
    this.downloadStreams.clear();

    console.log('GridFS manager shutdown complete');
  }
}

// Benefits of MongoDB GridFS:
// - Automatic file chunking for large files without manual implementation
// - Integrated metadata storage with file data for consistency
// - Native support for file streaming and range requests
// - Distributed storage with MongoDB's replication and sharding
// - ACID transactions for file operations with database consistency
// - Built-in indexing and querying capabilities for file metadata
// - Automatic chunk deduplication and storage optimization
// - Native backup and disaster recovery with MongoDB tooling
// - Seamless integration with existing MongoDB security and access control
// - SQL-compatible file operations through QueryLeaf integration

module.exports = {
  AdvancedGridFSManager
};

Understanding MongoDB GridFS Architecture

Advanced File Storage and Distribution Patterns

Implement sophisticated GridFS strategies for production MongoDB deployments:

// Production-ready MongoDB GridFS with advanced optimization and enterprise features
class EnterpriseGridFSManager extends AdvancedGridFSManager {
  constructor(db, enterpriseConfig) {
    super(db, enterpriseConfig);

    this.enterpriseConfig = {
      ...enterpriseConfig,
      enableShardedStorage: true,
      enableAdvancedSecurity: true,
      enableContentDeliveryNetwork: true,
      enableAutoTiering: true,
      enableAdvancedAnalytics: true,
      enableComplianceFeatures: true
    };

    this.setupEnterpriseFeatures();
    this.initializeAdvancedSecurity();
    this.setupContentDeliveryNetwork();
  }

  async implementShardedFileStorage() {
    console.log('Implementing sharded GridFS storage...');

    const shardingStrategy = {
      // Shard key design for GridFS collections
      filesShardKey: { 'metadata.userId': 1, uploadDate: 1 },
      chunksShardKey: { files_id: 1 },

      // Distribution optimization
      enableZoneSharding: true,
      geographicDistribution: true,
      loadBalancing: true,

      // Performance optimization
      enableLocalReads: true,
      enableWriteDistribution: true,
      chunkDistributionStrategy: 'round_robin'
    };

    return await this.deployShardedGridFS(shardingStrategy);
  }

  async setupAdvancedContentDelivery() {
    console.log('Setting up advanced content delivery network...');

    const cdnConfig = {
      // Edge caching strategy
      edgeCaching: {
        enableEdgeNodes: true,
        cacheSize: '10GB',
        cacheTTL: 3600000, // 1 hour
        enableIntelligentCaching: true
      },

      // Content optimization
      contentOptimization: {
        enableImageOptimization: true,
        enableVideoTranscoding: true,
        enableCompressionOptimization: true,
        enableAdaptiveStreaming: true
      },

      // Global distribution
      globalDistribution: {
        enableMultiRegion: true,
        regions: ['us-east-1', 'eu-west-1', 'ap-southeast-1'],
        enableGeoRouting: true,
        enableFailover: true
      }
    };

    return await this.deployContentDeliveryNetwork(cdnConfig);
  }
}

SQL-Style GridFS Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB GridFS operations and file management:

-- QueryLeaf GridFS operations with SQL-familiar syntax for MongoDB

-- Create GridFS buckets with SQL-style DDL
CREATE GRIDFS_BUCKET documents 
WITH (
  chunk_size = '512KB',
  write_concern = 'majority',
  read_concern = 'majority',
  enable_sharding = true
);

CREATE GRIDFS_BUCKET images
WITH (
  chunk_size = '256KB',
  enable_compression = true,
  enable_thumbnail_generation = true,
  content_types = ['image/jpeg', 'image/png', 'image/gif']
);

-- File upload with enhanced metadata
INSERT INTO GRIDFS('documents') (
  filename, content_type, file_data, metadata
) VALUES (
  'enterprise-report.pdf',
  'application/pdf', 
  FILE_STREAM('/path/to/enterprise-report.pdf'),
  JSON_OBJECT(
    'category', 'reports',
    'department', 'finance',
    'classification', 'internal',
    'tags', JSON_ARRAY('quarterly', 'financial', 'analysis'),
    'access_level', 'restricted',
    'retention_years', 7,
    'compliance_flags', JSON_OBJECT(
      'gdpr_applicable', true,
      'sox_applicable', true,
      'data_classification', 'sensitive'
    ),
    'business_context', JSON_OBJECT(
      'project_id', 'PROJ-2025-Q1',
      'cost_center', 'CC-FINANCE-001',
      'stakeholders', JSON_ARRAY('john.doe@company.com', 'jane.smith@company.com')
    )
  )
);

-- Bulk file upload with batch processing
INSERT INTO GRIDFS('images') (filename, content_type, file_data, metadata)
WITH file_batch AS (
  SELECT 
    original_filename as filename,
    detected_content_type as content_type,
    file_binary_data as file_data,

    -- Enhanced metadata generation
    JSON_OBJECT(
      'upload_batch_id', batch_id,
      'uploaded_by', uploader_user_id,
      'upload_source', upload_source,
      'original_path', original_file_path,

      -- Image-specific metadata
      'image_metadata', JSON_OBJECT(
        'width', image_width,
        'height', image_height,
        'format', image_format,
        'color_space', color_space,
        'has_transparency', has_alpha_channel,
        'camera_info', camera_metadata
      ),

      -- Processing instructions
      'processing_queue', JSON_ARRAY(
        'thumbnail_generation',
        'format_optimization',
        'metadata_extraction',
        'duplicate_detection'
      ),

      -- Organization
      'album_id', album_id,
      'event_date', event_date,
      'location', geo_location,
      'tags', detected_tags,

      -- Access control
      'visibility', photo_visibility,
      'sharing_permissions', sharing_rules,
      'privacy_level', privacy_setting
    ) as metadata

  FROM staging_images 
  WHERE processing_status = 'ready_for_upload'
    AND upload_date >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
)
SELECT filename, content_type, file_data, metadata
FROM file_batch
WHERE content_type LIKE 'image/%'

-- GridFS bulk upload configuration  
WITH UPLOAD_OPTIONS (
  concurrent_uploads = 10,
  chunk_size = '256KB',
  enable_deduplication = true,
  enable_virus_scanning = true,
  processing_priority = 'normal'
);

-- Query files with advanced filtering and metadata search
WITH file_search AS (
  SELECT 
    file_id,
    filename,
    upload_date,
    length as file_size_bytes,

    -- Extract metadata fields
    JSON_EXTRACT(metadata, '$.category') as category,
    JSON_EXTRACT(metadata, '$.department') as department,
    JSON_EXTRACT(metadata, '$.uploaded_by') as uploaded_by,
    JSON_EXTRACT(metadata, '$.tags') as tags,
    JSON_EXTRACT(metadata, '$.access_level') as access_level,
    JSON_EXTRACT(metadata, '$.content_type') as content_type,

    -- Calculate file age and size categories
    EXTRACT(DAYS FROM CURRENT_TIMESTAMP - upload_date) as age_days,
    CASE 
      WHEN length < 1024 * 1024 THEN 'small'
      WHEN length < 10 * 1024 * 1024 THEN 'medium'
      WHEN length < 100 * 1024 * 1024 THEN 'large'
      ELSE 'very_large'
    END as size_category,

    -- Access patterns
    JSON_EXTRACT(metadata, '$.download_count') as download_count,
    JSON_EXTRACT(metadata, '$.last_accessed') as last_accessed,

    -- Processing status
    JSON_EXTRACT(metadata, '$.processing_status') as processing_status,
    JSON_EXTRACT(metadata, '$.hash') as file_hash,

    -- Business context
    JSON_EXTRACT(metadata, '$.business_context.project_id') as project_id,
    JSON_EXTRACT(metadata, '$.business_context.cost_center') as cost_center

  FROM GRIDFS_FILES('documents')
  WHERE 
    -- Time-based filtering
    upload_date >= CURRENT_TIMESTAMP - INTERVAL '90 days'

    -- Access level filtering (security)
    AND (
      JSON_EXTRACT(metadata, '$.access_level') = 'public'
      OR (
        JSON_EXTRACT(metadata, '$.access_level') = 'restricted' 
        AND CURRENT_USER_HAS_PERMISSION('restricted_files')
      )
      OR (
        JSON_EXTRACT(metadata, '$.uploaded_by') = CURRENT_USER_ID()
      )
    )

    -- Content filtering
    AND processing_status = 'completed'

  UNION ALL

  -- Include image files with different criteria
  SELECT 
    file_id,
    filename,
    upload_date,
    length as file_size_bytes,
    JSON_EXTRACT(metadata, '$.category') as category,
    'media' as department,
    JSON_EXTRACT(metadata, '$.uploaded_by') as uploaded_by,
    JSON_EXTRACT(metadata, '$.tags') as tags,
    JSON_EXTRACT(metadata, '$.visibility') as access_level,
    'image' as content_type,
    EXTRACT(DAYS FROM CURRENT_TIMESTAMP - upload_date) as age_days,
    CASE 
      WHEN length < 1024 * 1024 THEN 'small'
      WHEN length < 5 * 1024 * 1024 THEN 'medium' 
      WHEN length < 20 * 1024 * 1024 THEN 'large'
      ELSE 'very_large'
    END as size_category,
    COALESCE(JSON_EXTRACT(metadata, '$.view_count'), 0) as download_count,
    JSON_EXTRACT(metadata, '$.last_viewed') as last_accessed,
    JSON_EXTRACT(metadata, '$.processing_status') as processing_status,
    JSON_EXTRACT(metadata, '$.hash') as file_hash,
    JSON_EXTRACT(metadata, '$.album_id') as project_id,
    'MEDIA-STORAGE' as cost_center

  FROM GRIDFS_FILES('images')
  WHERE upload_date >= CURRENT_TIMESTAMP - INTERVAL '30 days'
),

usage_analytics AS (
  SELECT 
    fs.*,

    -- Usage classification
    CASE 
      WHEN download_count >= 100 THEN 'frequently_accessed'
      WHEN download_count >= 10 THEN 'moderately_accessed'
      WHEN download_count >= 1 THEN 'rarely_accessed'
      ELSE 'never_accessed'
    END as usage_pattern,

    -- Age-based classification
    CASE 
      WHEN age_days <= 7 THEN 'very_recent'
      WHEN age_days <= 30 THEN 'recent'
      WHEN age_days <= 90 THEN 'moderate_age'
      ELSE 'old'
    END as age_category,

    -- Storage optimization recommendations
    CASE 
      WHEN age_days > 365 AND download_count = 0 THEN 'candidate_for_archival'
      WHEN size_category = 'very_large' AND usage_pattern = 'never_accessed' THEN 'candidate_for_compression'
      WHEN age_days <= 30 AND usage_pattern = 'frequently_accessed' THEN 'hot_storage_candidate'
      ELSE 'standard_storage'
    END as storage_recommendation,

    -- Content insights
    ARRAY_LENGTH(
      STRING_TO_ARRAY(
        REPLACE(REPLACE(JSON_EXTRACT_TEXT(tags), '[', ''), ']', ''), 
        ','
      ), 
      1
    ) as tag_count,

    -- Cost analysis (estimated)
    CASE 
      WHEN size_category = 'small' THEN file_size_bytes * 0.000001  -- $0.001/GB/month
      WHEN size_category = 'medium' THEN file_size_bytes * 0.0000008
      WHEN size_category = 'large' THEN file_size_bytes * 0.0000005
      ELSE file_size_bytes * 0.0000003
    END as estimated_monthly_storage_cost

  FROM file_search fs
),

aggregated_insights AS (
  SELECT 
    department,
    category,
    content_type,
    age_category,
    usage_pattern,
    size_category,
    storage_recommendation,

    -- Volume metrics
    COUNT(*) as file_count,
    SUM(file_size_bytes) as total_size_bytes,
    AVG(file_size_bytes) as avg_file_size,

    -- Usage metrics
    SUM(download_count) as total_downloads,
    AVG(download_count) as avg_downloads_per_file,
    COUNT(*) FILTER (WHERE download_count = 0) as unused_files,

    -- Age distribution
    AVG(age_days) as avg_age_days,
    MIN(upload_date) as oldest_file_date,
    MAX(upload_date) as newest_file_date,

    -- Storage cost analysis
    SUM(estimated_monthly_storage_cost) as estimated_monthly_cost,

    -- Content analysis
    AVG(tag_count) as avg_tags_per_file,
    COUNT(DISTINCT uploaded_by) as unique_uploaders,
    COUNT(DISTINCT project_id) as unique_projects

  FROM usage_analytics
  GROUP BY 
    department, category, content_type, age_category, 
    usage_pattern, size_category, storage_recommendation
)

SELECT 
  -- Classification dimensions
  department,
  category,
  content_type,
  age_category,
  usage_pattern,
  size_category,

  -- Volume and size metrics
  file_count,
  ROUND(total_size_bytes / 1024.0 / 1024.0 / 1024.0, 2) as total_size_gb,
  ROUND(avg_file_size / 1024.0 / 1024.0, 2) as avg_file_size_mb,

  -- Usage analytics
  total_downloads,
  ROUND(avg_downloads_per_file, 1) as avg_downloads_per_file,
  unused_files,
  ROUND((unused_files::DECIMAL / file_count) * 100, 1) as unused_files_percent,

  -- Age and lifecycle
  ROUND(avg_age_days, 1) as avg_age_days,
  oldest_file_date,
  newest_file_date,

  -- Content insights
  ROUND(avg_tags_per_file, 1) as avg_tags_per_file,
  unique_uploaders,
  unique_projects,

  -- Cost optimization
  ROUND(estimated_monthly_cost, 4) as estimated_monthly_cost_usd,
  storage_recommendation,

  -- Actionable insights
  CASE storage_recommendation
    WHEN 'candidate_for_archival' THEN 'Move to cold storage or delete if no business value'
    WHEN 'candidate_for_compression' THEN 'Enable compression to reduce storage costs'
    WHEN 'hot_storage_candidate' THEN 'Ensure high-performance storage tier'
    ELSE 'Current storage tier appropriate'
  END as recommended_action,

  -- Priority scoring for action
  CASE 
    WHEN storage_recommendation = 'candidate_for_archival' AND unused_files_percent > 80 THEN 'high_priority'
    WHEN storage_recommendation = 'candidate_for_compression' AND total_size_gb > 10 THEN 'high_priority'
    WHEN storage_recommendation = 'hot_storage_candidate' AND avg_downloads_per_file > 50 THEN 'high_priority'
    WHEN unused_files_percent > 50 THEN 'medium_priority'
    ELSE 'low_priority'
  END as action_priority

FROM aggregated_insights
WHERE file_count > 0
ORDER BY 
  CASE action_priority
    WHEN 'high_priority' THEN 1
    WHEN 'medium_priority' THEN 2
    ELSE 3
  END,
  total_size_gb DESC,
  file_count DESC;

-- File streaming with range support and performance optimization
WITH file_stream_request AS (
  SELECT 
    file_id,
    filename,
    length as total_size,
    content_type,
    upload_date,

    -- Extract streaming metadata
    JSON_EXTRACT(metadata, '$.streaming_optimized') as streaming_optimized,
    JSON_EXTRACT(metadata, '$.cdn_enabled') as cdn_enabled,
    JSON_EXTRACT(metadata, '$.cache_headers') as cache_headers,

    -- Range request parameters (would be provided by application)
    $range_start as range_start,
    $range_end as range_end,

    -- Calculate effective range
    COALESCE($range_start, 0) as effective_start,
    COALESCE($range_end, length - 1) as effective_end,

    -- Streaming metadata
    JSON_EXTRACT(metadata, '$.video_metadata.duration') as video_duration,
    JSON_EXTRACT(metadata, '$.video_metadata.bitrate') as video_bitrate,
    JSON_EXTRACT(metadata, '$.image_metadata.width') as image_width,
    JSON_EXTRACT(metadata, '$.image_metadata.height') as image_height

  FROM GRIDFS_FILES('videos')
  WHERE file_id = $requested_file_id
)

SELECT 
  fsr.file_id,
  fsr.filename,
  fsr.content_type,
  fsr.total_size,

  -- Range information
  fsr.effective_start,
  fsr.effective_end,
  (fsr.effective_end - fsr.effective_start + 1) as range_size,

  -- Content headers for HTTP response
  'bytes ' || fsr.effective_start || '-' || fsr.effective_end || '/' || fsr.total_size as content_range_header,

  CASE 
    WHEN fsr.effective_start = 0 AND fsr.effective_end = fsr.total_size - 1 THEN '200'
    ELSE '206' -- Partial content
  END as http_status_code,

  -- Caching and performance headers
  CASE fsr.content_type
    WHEN 'image/jpeg' THEN 'public, max-age=2592000' -- 30 days
    WHEN 'image/png' THEN 'public, max-age=2592000'
    WHEN 'video/mp4' THEN 'public, max-age=3600' -- 1 hour
    WHEN 'application/pdf' THEN 'private, max-age=1800' -- 30 minutes
    ELSE 'private, max-age=300' -- 5 minutes
  END as cache_control_header,

  -- Streaming optimization flags
  fsr.streaming_optimized::BOOLEAN as is_streaming_optimized,
  fsr.cdn_enabled::BOOLEAN as use_cdn,

  -- Performance estimates
  CASE 
    WHEN fsr.video_bitrate IS NOT NULL THEN
      ROUND((fsr.effective_end - fsr.effective_start + 1) / (fsr.video_bitrate::DECIMAL * 1024 / 8), 2)
    ELSE NULL
  END as estimated_streaming_seconds,

  -- Content metadata for client
  JSON_OBJECT(
    'total_duration', fsr.video_duration,
    'bitrate_kbps', fsr.video_bitrate,
    'width', fsr.image_width,
    'height', fsr.image_height,
    'supports_range_requests', true,
    'chunk_size_optimized', true,
    'streaming_ready', fsr.streaming_optimized::BOOLEAN
  ) as content_metadata,

  -- GridFS streaming query (this would trigger the actual data retrieval)
  GRIDFS_STREAM(fsr.file_id, fsr.effective_start, fsr.effective_end) as file_stream

FROM file_stream_request fsr;

-- Advanced file analytics and storage optimization
WITH storage_utilization AS (
  SELECT 
    bucket_name,
    DATE_TRUNC('day', upload_date) as upload_day,

    -- Daily storage metrics
    COUNT(*) as daily_files,
    SUM(length) as daily_storage_bytes,
    AVG(length) as avg_file_size_daily,

    -- Content type distribution
    COUNT(*) FILTER (WHERE JSON_EXTRACT(metadata, '$.content_type') LIKE 'image/%') as image_files,
    COUNT(*) FILTER (WHERE JSON_EXTRACT(metadata, '$.content_type') LIKE 'video/%') as video_files,
    COUNT(*) FILTER (WHERE JSON_EXTRACT(metadata, '$.content_type') LIKE 'application/%') as document_files,

    -- Processing status
    COUNT(*) FILTER (WHERE JSON_EXTRACT(metadata, '$.processing_status') = 'completed') as processed_files,
    COUNT(*) FILTER (WHERE JSON_EXTRACT(metadata, '$.processing_status') = 'failed') as failed_files,

    -- Access patterns
    SUM(COALESCE(JSON_EXTRACT(metadata, '$.download_count')::INTEGER, 0)) as total_downloads,
    AVG(COALESCE(JSON_EXTRACT(metadata, '$.download_count')::INTEGER, 0)) as avg_downloads_per_file

  FROM (
    SELECT 'documents' as bucket_name, file_id, filename, length, upload_date, metadata 
    FROM GRIDFS_FILES('documents')
    UNION ALL
    SELECT 'images' as bucket_name, file_id, filename, length, upload_date, metadata 
    FROM GRIDFS_FILES('images') 
    UNION ALL
    SELECT 'videos' as bucket_name, file_id, filename, length, upload_date, metadata 
    FROM GRIDFS_FILES('videos')
  ) all_files
  WHERE upload_date >= CURRENT_TIMESTAMP - INTERVAL '30 days'
  GROUP BY bucket_name, DATE_TRUNC('day', upload_date)
),

performance_analysis AS (
  SELECT 
    su.*,

    -- Growth analysis
    LAG(daily_storage_bytes) OVER (
      PARTITION BY bucket_name 
      ORDER BY upload_day
    ) as prev_day_storage,

    -- Calculate growth rate
    CASE 
      WHEN LAG(daily_storage_bytes) OVER (PARTITION BY bucket_name ORDER BY upload_day) > 0 THEN
        ROUND(
          ((daily_storage_bytes - LAG(daily_storage_bytes) OVER (PARTITION BY bucket_name ORDER BY upload_day))::DECIMAL / 
           LAG(daily_storage_bytes) OVER (PARTITION BY bucket_name ORDER BY upload_day)) * 100, 
          1
        )
      ELSE NULL
    END as storage_growth_percent,

    -- Performance indicators
    ROUND(daily_storage_bytes / NULLIF(daily_files, 0) / 1024.0 / 1024.0, 2) as avg_file_size_mb,
    ROUND(total_downloads::DECIMAL / NULLIF(daily_files, 0), 2) as download_ratio,

    -- Processing efficiency
    ROUND((processed_files::DECIMAL / NULLIF(daily_files, 0)) * 100, 1) as processing_success_rate,
    ROUND((failed_files::DECIMAL / NULLIF(daily_files, 0)) * 100, 1) as processing_failure_rate,

    -- Storage efficiency indicators
    CASE 
      WHEN avg_downloads_per_file = 0 THEN 'unused_storage'
      WHEN avg_downloads_per_file < 0.1 THEN 'low_utilization'
      WHEN avg_downloads_per_file < 1.0 THEN 'moderate_utilization'
      ELSE 'high_utilization'
    END as utilization_category

  FROM storage_utilization su
)

SELECT 
  bucket_name,
  upload_day,

  -- Volume metrics
  daily_files,
  ROUND(daily_storage_bytes / 1024.0 / 1024.0 / 1024.0, 3) as daily_storage_gb,
  avg_file_size_mb,

  -- Content distribution
  image_files,
  video_files,
  document_files,

  -- Performance metrics
  processing_success_rate,
  processing_failure_rate,
  download_ratio,
  utilization_category,

  -- Growth analysis
  storage_growth_percent,

  -- Optimization recommendations
  CASE 
    WHEN utilization_category = 'unused_storage' THEN 'implement_retention_policy'
    WHEN processing_failure_rate > 10 THEN 'investigate_processing_issues'
    WHEN storage_growth_percent > 100 THEN 'monitor_storage_capacity'
    WHEN avg_file_size_mb > 100 THEN 'consider_compression_optimization'
    ELSE 'storage_operating_normally'
  END as optimization_recommendation,

  -- Projected storage (simple linear projection)
  CASE 
    WHEN storage_growth_percent IS NOT NULL THEN
      ROUND(
        daily_storage_bytes * (1 + storage_growth_percent / 100) * 30 / 1024.0 / 1024.0 / 1024.0, 
        2
      )
    ELSE NULL
  END as projected_monthly_storage_gb,

  -- Alert conditions
  CASE 
    WHEN processing_failure_rate > 20 THEN 'critical_processing_failure'
    WHEN storage_growth_percent > 200 THEN 'critical_storage_growth'
    WHEN utilization_category = 'unused_storage' AND daily_storage_gb > 1 THEN 'storage_waste_alert'
    ELSE 'normal_operations'
  END as alert_status

FROM performance_analysis
WHERE daily_files > 0
ORDER BY 
  CASE alert_status
    WHEN 'critical_processing_failure' THEN 1
    WHEN 'critical_storage_growth' THEN 2
    WHEN 'storage_waste_alert' THEN 3
    ELSE 4
  END,
  bucket_name,
  upload_day DESC;

-- QueryLeaf provides comprehensive GridFS capabilities:
-- 1. SQL-familiar GridFS bucket creation and management
-- 2. Advanced file upload with metadata enrichment and batch processing
-- 3. Efficient file querying with metadata search and filtering
-- 4. High-performance file streaming with range request support
-- 5. Comprehensive storage analytics and optimization recommendations
-- 6. Integration with MongoDB's native GridFS optimizations
-- 7. Advanced access control and security features
-- 8. SQL-style operations for complex file management workflows
-- 9. Built-in performance monitoring and capacity planning
-- 10. Enterprise-ready file storage with distributed capabilities

Best Practices for GridFS Implementation

File Storage Strategy and Performance Optimization

Essential principles for effective MongoDB GridFS deployment:

Chunk Size Optimization: Choose chunk sizes based on file types and access patterns - smaller chunks for random access, larger chunks for sequential streaming
Bucket Organization: Create separate buckets for different file types to optimize chunk sizes and indexing strategies
Metadata Design: Implement comprehensive metadata schemas that support efficient querying and business requirements
Index Strategy: Create strategic indexes on frequently queried metadata fields while avoiding over-indexing
Security Integration: Implement access control and encryption that integrates with application security frameworks
Performance Monitoring: Track upload/download performance, storage utilization, and access patterns for optimization

Production Deployment and Operational Excellence

Design GridFS systems for enterprise-scale requirements:

Distributed Architecture: Implement GridFS across sharded clusters with proper shard key design for balanced distribution
Backup and Recovery: Design backup strategies that account for GridFS's dual-collection structure (files and chunks)
Content Delivery: Integrate with CDN and caching layers for optimal global content delivery performance
Storage Tiering: Implement automated data lifecycle management with hot, warm, and cold storage tiers
Compliance Features: Build in data governance, audit trails, and regulatory compliance capabilities
Monitoring and Alerting: Establish comprehensive monitoring for storage utilization, performance, and system health

Conclusion

MongoDB GridFS provides comprehensive distributed file storage that eliminates the complexity of traditional file management systems through automatic chunking, integrated metadata storage, and seamless integration with MongoDB's distributed architecture. The unified approach to file and database operations enables sophisticated file management workflows while maintaining ACID properties and enterprise-grade reliability.

Key MongoDB GridFS benefits include:

Automatic Chunking: Seamless handling of large files without manual chunk management or size limitations
Integrated Metadata: Rich metadata storage with file data for complex querying and business logic integration
Distributed Storage: Native support for MongoDB's replication and sharding for global file distribution
Streaming Capabilities: Efficient file streaming and range requests for multimedia and large file applications
Transaction Support: ACID transactions for file operations integrated with database consistency guarantees
SQL Accessibility: Familiar SQL-style file operations through QueryLeaf for accessible enterprise file management

Whether you're building content management systems, media platforms, document repositories, or enterprise file storage solutions, MongoDB GridFS with QueryLeaf's familiar SQL interface provides the foundation for scalable, reliable, and feature-rich file storage architectures.

QueryLeaf Integration: QueryLeaf seamlessly manages MongoDB GridFS operations while providing SQL-familiar syntax for file uploads, downloads, streaming, and metadata management. Advanced file storage patterns including distributed storage, content delivery, and enterprise security features are elegantly handled through familiar SQL constructs, making sophisticated file management both powerful and accessible to SQL-oriented development teams.

The combination of MongoDB's robust GridFS capabilities with SQL-style file operations makes it an ideal platform for applications requiring both advanced file storage functionality and familiar database interaction patterns, ensuring your file storage infrastructure can scale efficiently while maintaining operational simplicity and developer productivity.

November 5, 2025
23 min read

MongoDB Atlas Search and Advanced Text Indexing: Full-Text Search with Vector Similarity and Multi-Language Support

Modern applications require sophisticated search capabilities that go beyond simple text matching to provide relevant, contextual results across multiple data types and languages. Traditional full-text search implementations struggle with semantic understanding, multi-language support, and the complexity of integrating machine learning-based relevance scoring, often requiring separate search engines and complex data synchronization processes that increase operational overhead and system complexity.

MongoDB Atlas Search provides comprehensive native search capabilities with advanced text indexing, vector similarity search, and intelligent relevance scoring that eliminate the need for external search engines. Unlike traditional approaches that require separate search infrastructure and complex data pipelines, Atlas Search integrates seamlessly with MongoDB collections, providing real-time search synchronization, multi-language support, and machine learning-enhanced search experiences within a unified platform.

The Traditional Search Challenge

Conventional search implementations involve significant complexity and operational burden:

-- Traditional PostgreSQL full-text search approach - limited and complex
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE EXTENSION IF NOT EXISTS unaccent;

-- Basic document storage with limited search capabilities
CREATE TABLE documents (
    document_id BIGSERIAL PRIMARY KEY,
    title VARCHAR(500) NOT NULL,
    content TEXT NOT NULL,
    author VARCHAR(200),
    category VARCHAR(100),
    tags VARCHAR(255)[],

    -- Language and localization
    language VARCHAR(10) DEFAULT 'en',
    content_locale VARCHAR(10),

    -- Metadata for search
    publish_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    modified_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    status VARCHAR(50) DEFAULT 'published',

    -- Basic search vectors (very limited functionality)
    title_vector TSVECTOR,
    content_vector TSVECTOR,
    combined_vector TSVECTOR
);

-- Manual maintenance of search vectors required
CREATE OR REPLACE FUNCTION update_document_search_vectors()
RETURNS TRIGGER AS $$
BEGIN
    -- Basic text search vector creation (limited language support)
    NEW.title_vector := to_tsvector('english', COALESCE(NEW.title, ''));
    NEW.content_vector := to_tsvector('english', COALESCE(NEW.content, ''));
    NEW.combined_vector := to_tsvector('english', 
        COALESCE(NEW.title, '') || ' ' || COALESCE(NEW.content, '')
    );
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trigger_update_search_vectors
    BEFORE INSERT OR UPDATE ON documents
    FOR EACH ROW EXECUTE FUNCTION update_document_search_vectors();

-- Basic GIN indexes for text search (limited optimization)
CREATE INDEX idx_documents_title_search ON documents USING GIN(title_vector);
CREATE INDEX idx_documents_content_search ON documents USING GIN(content_vector);
CREATE INDEX idx_documents_combined_search ON documents USING GIN(combined_vector);
CREATE INDEX idx_documents_category_status ON documents(category, status);

-- User search behavior and analytics tracking
CREATE TABLE search_queries (
    query_id BIGSERIAL PRIMARY KEY,
    user_id BIGINT,
    session_id VARCHAR(100),
    query_text TEXT NOT NULL,
    query_language VARCHAR(10) DEFAULT 'en',

    -- Search parameters
    filters_applied JSONB,
    sort_criteria VARCHAR(100),
    page_number INTEGER DEFAULT 1,
    results_per_page INTEGER DEFAULT 10,

    -- Search results and performance
    total_results_found INTEGER,
    execution_time_ms INTEGER,
    results_clicked INTEGER[] DEFAULT '{}',

    -- User context
    user_agent TEXT,
    referrer TEXT,
    search_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Search quality metrics
    user_satisfaction INTEGER CHECK (user_satisfaction BETWEEN 1 AND 5),
    bounce_rate DECIMAL(4,2),
    conversion_achieved BOOLEAN DEFAULT FALSE
);

-- Complex search query with limited capabilities
WITH search_base AS (
    SELECT 
        d.document_id,
        d.title,
        d.content,
        d.author,
        d.category,
        d.tags,
        d.publish_date,
        d.language,

        -- Basic relevance scoring (very primitive)
        ts_rank_cd(d.title_vector, plainto_tsquery('english', $search_query)) * 2.0 as title_relevance,
        ts_rank_cd(d.content_vector, plainto_tsquery('english', $search_query)) as content_relevance,

        -- Combine relevance scores
        (ts_rank_cd(d.title_vector, plainto_tsquery('english', $search_query)) * 2.0 +
         ts_rank_cd(d.content_vector, plainto_tsquery('english', $search_query))) as combined_relevance,

        -- Simple popularity boost (no ML)
        LOG(GREATEST(1, (SELECT COUNT(*) FROM search_queries sq WHERE sq.results_clicked @> ARRAY[d.document_id]))) as popularity_score,

        -- Basic category boosting
        CASE 
            WHEN d.category = $preferred_category THEN 1.2
            ELSE 1.0
        END as category_boost,

        -- Recency boost (basic time decay)
        CASE 
            WHEN d.publish_date >= CURRENT_TIMESTAMP - INTERVAL '30 days' THEN 1.3
            WHEN d.publish_date >= CURRENT_TIMESTAMP - INTERVAL '90 days' THEN 1.1
            ELSE 1.0
        END as recency_boost

    FROM documents d
    WHERE 
        d.status = 'published'
        AND ($language IS NULL OR d.language = $language)
        AND ($category_filter IS NULL OR d.category = $category_filter)

        -- Basic text search (limited semantic understanding)
        AND (
            d.combined_vector @@ plainto_tsquery('english', $search_query)
            OR SIMILARITY(d.title, $search_query) > 0.3
            OR d.title ILIKE '%' || $search_query || '%'
            OR d.content ILIKE '%' || $search_query || '%'
        )
),

search_with_scoring AS (
    SELECT 
        sb.*,

        -- Final relevance calculation (very basic)
        GREATEST(0.1, 
            sb.combined_relevance * sb.category_boost * sb.recency_boost + 
            (sb.popularity_score * 0.1)
        ) as final_relevance_score,

        -- Extract matching snippets (primitive)
        ts_headline('english', 
            LEFT(sb.content, 1000), 
            plainto_tsquery('english', $search_query),
            'MaxWords=35, MinWords=15, MaxFragments=3'
        ) as content_snippet,

        -- Count matching terms (basic)
        (SELECT COUNT(*) 
         FROM unnest(string_to_array(lower($search_query), ' ')) as query_word
         WHERE lower(sb.title || ' ' || sb.content) LIKE '%' || query_word || '%'
        ) as matching_terms_count,

        -- Simple spell correction suggestions (very limited)
        CASE 
            WHEN SIMILARITY(sb.title, $search_query) < 0.1 THEN
                (SELECT string_agg(suggestion, ' ') 
                 FROM (
                     SELECT word as suggestion 
                     FROM unnest(string_to_array($search_query, ' ')) as word
                     ORDER BY SIMILARITY(word, sb.title) DESC 
                     LIMIT 3
                 ) suggestions)
            ELSE NULL
        END as spelling_suggestions

    FROM search_base sb
),

search_analytics AS (
    -- Track search performance (basic analytics)
    SELECT 
        CURRENT_TIMESTAMP as search_executed_at,
        $search_query as query_executed,
        COUNT(*) as total_results_found,
        AVG(sws.final_relevance_score) as avg_relevance_score,
        MAX(sws.final_relevance_score) as max_relevance_score,

        -- Category distribution
        json_object_agg(sws.category, COUNT(sws.category)) as results_by_category,

        -- Language distribution  
        json_object_agg(sws.language, COUNT(sws.language)) as results_by_language

    FROM search_with_scoring sws
    WHERE sws.final_relevance_score > 0.1
)

-- Final search results with basic ranking
SELECT 
    sws.document_id,
    sws.title,
    sws.author,
    sws.category,
    sws.tags,
    sws.publish_date,
    sws.language,

    -- Relevance and ranking
    ROUND(sws.final_relevance_score, 4) as relevance_score,
    ROW_NUMBER() OVER (ORDER BY sws.final_relevance_score DESC, sws.publish_date DESC) as search_rank,

    -- Content preview
    sws.content_snippet,
    LENGTH(sws.content) as content_length,
    sws.matching_terms_count,

    -- Search enhancements (very basic)
    sws.spelling_suggestions,

    -- Quality indicators
    CASE 
        WHEN sws.final_relevance_score > 0.8 THEN 'high'
        WHEN sws.final_relevance_score > 0.4 THEN 'medium'
        ELSE 'low'
    END as match_quality,

    -- Search metadata
    EXTRACT(DAYS FROM CURRENT_TIMESTAMP - sws.publish_date) as days_old

FROM search_with_scoring sws
WHERE sws.final_relevance_score > 0.1
ORDER BY sws.final_relevance_score DESC, sws.publish_date DESC
LIMIT $results_limit OFFSET $results_offset;

-- Insert search analytics
INSERT INTO search_queries (
    user_id, session_id, query_text, query_language, 
    total_results_found, execution_time_ms, search_timestamp
) VALUES (
    $user_id, $session_id, $search_query, $language,
    (SELECT COUNT(*) FROM search_with_scoring WHERE final_relevance_score > 0.1),
    $execution_time_ms, CURRENT_TIMESTAMP
);

-- Traditional search approach problems:
-- 1. Very limited semantic understanding and context awareness
-- 2. Poor multi-language support requiring separate configurations
-- 3. No vector similarity or machine learning capabilities
-- 4. Manual maintenance of search indexes and vectors
-- 5. Primitive relevance scoring without ML-based optimization
-- 6. No real-time search suggestions or autocomplete
-- 7. Limited spell correction and fuzzy matching capabilities
-- 8. Complex integration with external search engines required for advanced features
-- 9. No built-in search analytics or performance optimization
-- 10. Difficulty in handling multimedia and structured data search

MongoDB Atlas Search provides comprehensive search capabilities with advanced indexing and ML integration:

// MongoDB Atlas Search - Advanced full-text and vector search capabilities
const { MongoClient, ObjectId } = require('mongodb');

// Comprehensive Atlas Search Manager
class AtlasSearchManager {
  constructor(connectionString, searchConfig = {}) {
    this.connectionString = connectionString;
    this.client = null;
    this.db = null;

    this.config = {
      // Search configuration
      enableFullTextSearch: searchConfig.enableFullTextSearch !== false,
      enableVectorSearch: searchConfig.enableVectorSearch !== false,
      enableFacetedSearch: searchConfig.enableFacetedSearch !== false,
      enableAutocomplete: searchConfig.enableAutocomplete !== false,

      // Advanced features
      enableSemanticSearch: searchConfig.enableSemanticSearch !== false,
      enableMultiLanguageSearch: searchConfig.enableMultiLanguageSearch !== false,
      enableSpellCorrection: searchConfig.enableSpellCorrection !== false,
      enableSearchAnalytics: searchConfig.enableSearchAnalytics !== false,

      // Performance optimization
      searchResultLimit: searchConfig.searchResultLimit || 50,
      facetLimit: searchConfig.facetLimit || 20,
      highlightMaxChars: searchConfig.highlightMaxChars || 500,
      cacheSearchResults: searchConfig.cacheSearchResults !== false,

      // ML and AI features
      enableRelevanceScoring: searchConfig.enableRelevanceScoring !== false,
      enablePersonalization: searchConfig.enablePersonalization !== false,
      enableSearchSuggestions: searchConfig.enableSearchSuggestions !== false,

      ...searchConfig
    };

    // Collections
    this.collections = {
      documents: null,
      searchQueries: null,
      searchAnalytics: null,
      userProfiles: null,
      searchSuggestions: null,
      vectorEmbeddings: null
    };

    // Search indexes configuration
    this.searchIndexes = new Map();
    this.vectorIndexes = new Map();

    // Performance metrics
    this.searchMetrics = {
      totalSearches: 0,
      averageLatency: 0,
      searchesWithResults: 0,
      popularQueries: new Map()
    };
  }

  async initializeAtlasSearch() {
    console.log('Initializing MongoDB Atlas Search capabilities...');

    try {
      // Connect to MongoDB Atlas
      this.client = new MongoClient(this.connectionString);
      await this.client.connect();
      this.db = this.client.db();

      // Initialize collections
      await this.setupSearchCollections();

      // Create Atlas Search indexes
      await this.createAtlasSearchIndexes();

      // Setup vector search if enabled
      if (this.config.enableVectorSearch) {
        await this.setupVectorSearch();
      }

      // Initialize search analytics
      if (this.config.enableSearchAnalytics) {
        await this.setupSearchAnalytics();
      }

      console.log('Atlas Search initialization completed successfully');

    } catch (error) {
      console.error('Error initializing Atlas Search:', error);
      throw error;
    }
  }

  async setupSearchCollections() {
    console.log('Setting up search-optimized collections...');

    // Documents collection with search-optimized schema
    this.collections.documents = this.db.collection('documents');
    await this.collections.documents.createIndexes([
      { key: { title: 'text', content: 'text' }, background: true, name: 'text_search_fallback' },
      { key: { category: 1, status: 1, publishDate: -1 }, background: true },
      { key: { author: 1, publishDate: -1 }, background: true },
      { key: { tags: 1, language: 1 }, background: true },
      { key: { popularity: -1, relevanceScore: -1 }, background: true }
    ]);

    // Search queries and analytics
    this.collections.searchQueries = this.db.collection('search_queries');
    await this.collections.searchQueries.createIndexes([
      { key: { userId: 1, searchTimestamp: -1 }, background: true },
      { key: { queryText: 1, totalResults: -1 }, background: true },
      { key: { searchTimestamp: -1 }, background: true },
      { key: { sessionId: 1, searchTimestamp: -1 }, background: true }
    ]);

    // Search analytics aggregation collection
    this.collections.searchAnalytics = this.db.collection('search_analytics');
    await this.collections.searchAnalytics.createIndexes([
      { key: { analysisDate: -1 }, background: true },
      { key: { queryPattern: 1, frequency: -1 }, background: true }
    ]);

    // User profiles for personalization
    this.collections.userProfiles = this.db.collection('user_profiles');
    await this.collections.userProfiles.createIndexes([
      { key: { userId: 1 }, unique: true, background: true },
      { key: { 'searchPreferences.categories': 1 }, background: true },
      { key: { lastActivity: -1 }, background: true }
    ]);

    console.log('Search collections setup completed');
  }

  async createAtlasSearchIndexes() {
    console.log('Creating Atlas Search indexes...');

    // Main document search index with comprehensive text analysis
    const mainSearchIndex = {
      name: 'documents_search_index',
      definition: {
        mappings: {
          dynamic: false,
          fields: {
            title: {
              type: 'string',
              analyzer: 'lucene.standard',
              searchAnalyzer: 'lucene.standard',
              highlight: {
                type: 'html'
              }
            },
            content: {
              type: 'string',
              analyzer: 'lucene.standard',
              searchAnalyzer: 'lucene.standard',
              highlight: {
                type: 'html',
                maxCharsToExamine: this.config.highlightMaxChars
              }
            },
            author: {
              type: 'string',
              analyzer: 'lucene.keyword'
            },
            category: {
              type: 'string',
              analyzer: 'lucene.keyword'
            },
            tags: {
              type: 'string',
              analyzer: 'lucene.standard'
            },
            language: {
              type: 'string',
              analyzer: 'lucene.keyword'
            },
            publishDate: {
              type: 'date'
            },
            popularity: {
              type: 'number'
            },
            relevanceScore: {
              type: 'number'
            },
            // Nested content analysis
            sections: {
              type: 'document',
              fields: {
                heading: {
                  type: 'string',
                  analyzer: 'lucene.standard'
                },
                content: {
                  type: 'string',
                  analyzer: 'lucene.standard'
                },
                importance: {
                  type: 'number'
                }
              }
            },
            // Metadata for advanced search
            metadata: {
              type: 'document',
              fields: {
                readingLevel: { type: 'string' },
                contentType: { type: 'string' },
                sourceQuality: { type: 'number' },
                lastUpdated: { type: 'date' }
              }
            }
          }
        },
        analyzers: [{
          name: 'multilingual_analyzer',
          charFilters: [{
            type: 'mapping',
            mappings: {
              '&': 'and',
              '@': 'at'
            }
          }],
          tokenizer: {
            type: 'standard'
          },
          tokenFilters: [
            { type: 'lowercase' },
            { type: 'stop' },
            { type: 'stemmer', language: 'en' }
          ]
        }]
      }
    };

    // Autocomplete search index
    const autocompleteIndex = {
      name: 'autocomplete_search_index',
      definition: {
        mappings: {
          dynamic: false,
          fields: {
            title: {
              type: 'autocomplete',
              analyzer: 'lucene.standard',
              tokenization: 'edgeGram',
              minGrams: 2,
              maxGrams: 15,
              foldDiacritics: true
            },
            content: {
              type: 'autocomplete',
              analyzer: 'lucene.standard',
              tokenization: 'nGram',
              minGrams: 3,
              maxGrams: 10
            },
            tags: {
              type: 'autocomplete',
              analyzer: 'lucene.keyword',
              tokenization: 'keyword'
            },
            category: {
              type: 'string',
              analyzer: 'lucene.keyword'
            },
            popularity: {
              type: 'number'
            }
          }
        }
      }
    };

    // Faceted search index for advanced filtering
    const facetedSearchIndex = {
      name: 'faceted_search_index',
      definition: {
        mappings: {
          dynamic: false,
          fields: {
            title: {
              type: 'string',
              analyzer: 'lucene.standard'
            },
            content: {
              type: 'string',
              analyzer: 'lucene.standard'
            },
            category: {
              type: 'stringFacet'
            },
            author: {
              type: 'stringFacet'
            },
            language: {
              type: 'stringFacet'
            },
            tags: {
              type: 'stringFacet'
            },
            publishDate: {
              type: 'dateFacet',
              boundaries: [
                new Date('2020-01-01'),
                new Date('2021-01-01'),
                new Date('2022-01-01'),
                new Date('2023-01-01'),
                new Date('2024-01-01'),
                new Date('2025-01-01')
              ]
            },
            popularity: {
              type: 'numberFacet',
              boundaries: [0, 10, 50, 100, 500, 1000]
            },
            contentLength: {
              type: 'numberFacet',
              boundaries: [0, 1000, 5000, 10000, 50000]
            }
          }
        }
      }
    };

    // Store index configurations for reference
    this.searchIndexes.set('main', mainSearchIndex);
    this.searchIndexes.set('autocomplete', autocompleteIndex);
    this.searchIndexes.set('faceted', facetedSearchIndex);

    console.log('Atlas Search indexes configured');
    // Note: In production, these indexes would be created through Atlas UI or API
  }

  async performAdvancedTextSearch(query, options = {}) {
    console.log(`Performing advanced text search for: "${query}"`);

    const startTime = Date.now();

    try {
      // Build comprehensive search aggregation pipeline
      const searchPipeline = [
        {
          $search: {
            index: 'documents_search_index',
            compound: {
              should: [
                // Primary text search with boosting
                {
                  text: {
                    query: query,
                    path: ['title', 'content'],
                    score: {
                      boost: { value: 2.0 }
                    },
                    fuzzy: {
                      maxEdits: 2,
                      prefixLength: 0,
                      maxExpansions: 50
                    }
                  }
                },
                // Exact phrase matching with highest boost
                {
                  phrase: {
                    query: query,
                    path: ['title', 'content'],
                    score: {
                      boost: { value: 3.0 }
                    }
                  }
                },
                // Autocomplete matching for partial queries
                {
                  autocomplete: {
                    query: query,
                    path: 'title',
                    tokenOrder: 'sequential',
                    score: {
                      boost: { value: 1.5 }
                    }
                  }
                },
                // Semantic search using embeddings (if available)
                ...(options.enableSemanticSearch && this.config.enableVectorSearch ? [{
                  knnBeta: {
                    vector: await this.getQueryEmbedding(query),
                    path: 'contentEmbedding',
                    k: 20,
                    score: {
                      boost: { value: 1.2 }
                    }
                  }
                }] : [])
              ],

              // Apply filters
              filter: [
                ...(options.category ? [{
                  equals: {
                    path: 'category',
                    value: options.category
                  }
                }] : []),
                ...(options.language ? [{
                  equals: {
                    path: 'language',
                    value: options.language
                  }
                }] : []),
                ...(options.author ? [{
                  text: {
                    query: options.author,
                    path: 'author'
                  }
                }] : []),
                ...(options.dateRange ? [{
                  range: {
                    path: 'publishDate',
                    gte: options.dateRange.start,
                    lte: options.dateRange.end
                  }
                }] : []),
                {
                  equals: {
                    path: 'status',
                    value: 'published'
                  }
                }
              ],

              // Boost recent and popular content
              should: [
                {
                  range: {
                    path: 'publishDate',
                    gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000), // Last 30 days
                    score: {
                      boost: { value: 1.3 }
                    }
                  }
                },
                {
                  range: {
                    path: 'popularity',
                    gte: 100,
                    score: {
                      boost: { value: 1.2 }
                    }
                  }
                }
              ]
            },

            // Add search highlighting
            highlight: {
              path: ['title', 'content'],
              maxCharsToExamine: this.config.highlightMaxChars,
              maxNumPassages: 3
            }
          }
        },

        // Add computed fields for search results
        {
          $addFields: {
            searchScore: { $meta: 'searchScore' },
            searchHighlights: { $meta: 'searchHighlights' },

            // Calculate content preview
            contentPreview: {
              $substr: ['$content', 0, 300]
            },

            // Add relevance indicators
            relevanceIndicators: {
              hasExactMatch: {
                $or: [
                  { $regexMatch: { input: '$title', regex: query, options: 'i' } },
                  { $regexMatch: { input: '$content', regex: query, options: 'i' } }
                ]
              },
              isRecent: {
                $gte: ['$publishDate', new Date(Date.now() - 30 * 24 * 60 * 60 * 1000)]
              },
              isPopular: {
                $gte: ['$popularity', 50]
              }
            }
          }
        },

        // Add user personalization (if available)
        ...(options.userId ? [{
          $lookup: {
            from: 'user_profiles',
            localField: 'category',
            foreignField: 'searchPreferences.categories',
            as: 'personalizationMatch',
            pipeline: [
              { $match: { userId: options.userId } },
              { $limit: 1 }
            ]
          }
        }, {
          $addFields: {
            personalizationBoost: {
              $cond: [
                { $gt: [{ $size: '$personalizationMatch' }, 0] },
                1.4,
                1.0
              ]
            },
            finalScore: {
              $multiply: ['$searchScore', '$personalizationBoost']
            }
          }
        }] : [{
          $addFields: {
            finalScore: '$searchScore'
          }
        }]),

        // Sort by relevance and apply limits
        { $sort: { finalScore: -1, publishDate: -1 } },
        { $limit: options.limit || this.config.searchResultLimit },

        // Project final result structure
        {
          $project: {
            documentId: '$_id',
            title: 1,
            content: { $substr: ['$content', 0, 500] },
            author: 1,
            category: 1,
            tags: 1,
            publishDate: 1,
            language: 1,
            contentPreview: 1,

            // Search-specific fields
            searchScore: { $round: ['$finalScore', 4] },
            searchHighlights: 1,
            relevanceIndicators: 1,

            // Computed fields
            contentLength: { $strLenCP: '$content' },
            estimatedReadingTime: {
              $round: [{ $divide: [{ $strLenCP: '$content' }, 200] }, 0] // 200 words per minute
            },

            // Search result metadata
            searchRank: { $add: [{ $indexOfArray: [[], '$_id'] }, 1] },
            matchQuality: {
              $switch: {
                branches: [
                  { case: { $gte: ['$finalScore', 5.0] }, then: 'excellent' },
                  { case: { $gte: ['$finalScore', 3.0] }, then: 'good' },
                  { case: { $gte: ['$finalScore', 1.0] }, then: 'fair' }
                ],
                default: 'poor'
              }
            }
          }
        }
      ];

      // Execute search pipeline
      const searchResults = await this.collections.documents.aggregate(
        searchPipeline,
        { maxTimeMS: 10000 }
      ).toArray();

      const executionTime = Date.now() - startTime;

      // Log search query for analytics
      await this.logSearchQuery(query, searchResults.length, executionTime, options);

      // Update search metrics
      this.updateSearchMetrics(query, searchResults.length, executionTime);

      console.log(`Search completed: ${searchResults.length} results in ${executionTime}ms`);

      return {
        success: true,
        query: query,
        totalResults: searchResults.length,
        executionTime: executionTime,
        results: searchResults,
        searchMetadata: {
          hasSpellingSuggestions: false, // Would implement spell checking
          appliedFilters: options,
          searchComplexity: 'advanced',
          optimizationsApplied: ['boosting', 'fuzzy_matching', 'highlighting']
        }
      };

    } catch (error) {
      console.error('Error performing advanced text search:', error);
      return {
        success: false,
        error: error.message,
        query: query,
        executionTime: Date.now() - startTime
      };
    }
  }

  async setupVectorSearch() {
    console.log('Setting up vector search capabilities...');

    // Vector embeddings collection
    this.collections.vectorEmbeddings = this.db.collection('vector_embeddings');

    // Vector search index configuration
    const vectorSearchIndex = {
      name: 'vector_search_index',
      definition: {
        fields: [{
          type: 'vector',
          path: 'contentEmbedding',
          numDimensions: 1536, // OpenAI embedding dimensions
          similarity: 'cosine'
        }, {
          type: 'filter',
          path: 'documentId'
        }, {
          type: 'filter',
          path: 'embeddingType'
        }, {
          type: 'filter',
          path: 'language'
        }]
      }
    };

    this.vectorIndexes.set('content_vectors', vectorSearchIndex);

    // Create indexes for vector collection
    await this.collections.vectorEmbeddings.createIndexes([
      { key: { documentId: 1 }, unique: true, background: true },
      { key: { embeddingType: 1, language: 1 }, background: true },
      { key: { createdAt: -1 }, background: true }
    ]);

    console.log('Vector search setup completed');
  }

  async performVectorSearch(queryEmbedding, options = {}) {
    console.log('Performing vector similarity search...');

    const startTime = Date.now();

    try {
      const vectorSearchPipeline = [
        {
          $vectorSearch: {
            index: 'vector_search_index',
            path: 'contentEmbedding',
            queryVector: queryEmbedding,
            numCandidates: options.numCandidates || 100,
            limit: options.limit || 20,
            filter: {
              ...(options.language && { language: { $eq: options.language } }),
              ...(options.embeddingType && { embeddingType: { $eq: options.embeddingType } })
            }
          }
        },

        // Join with original documents
        {
          $lookup: {
            from: 'documents',
            localField: 'documentId',
            foreignField: '_id',
            as: 'document'
          }
        },

        // Unwind and add computed fields
        { $unwind: '$document' },
        {
          $addFields: {
            similarityScore: { $meta: 'vectorSearchScore' },
            semanticRelevance: {
              $switch: {
                branches: [
                  { case: { $gte: [{ $meta: 'vectorSearchScore' }, 0.8] }, then: 'very_high' },
                  { case: { $gte: [{ $meta: 'vectorSearchScore' }, 0.6] }, then: 'high' },
                  { case: { $gte: [{ $meta: 'vectorSearchScore' }, 0.4] }, then: 'medium' }
                ],
                default: 'low'
              }
            }
          }
        },

        // Project results
        {
          $project: {
            documentId: '$document._id',
            title: '$document.title',
            content: { $substr: ['$document.content', 0, 400] },
            author: '$document.author',
            category: '$document.category',
            similarityScore: { $round: ['$similarityScore', 4] },
            semanticRelevance: 1,
            embeddingType: 1,
            language: 1
          }
        }
      ];

      const vectorResults = await this.collections.vectorEmbeddings.aggregate(
        vectorSearchPipeline,
        { maxTimeMS: 15000 }
      ).toArray();

      const executionTime = Date.now() - startTime;

      console.log(`Vector search completed: ${vectorResults.length} results in ${executionTime}ms`);

      return {
        success: true,
        totalResults: vectorResults.length,
        executionTime: executionTime,
        results: vectorResults,
        searchType: 'vector_similarity'
      };

    } catch (error) {
      console.error('Error performing vector search:', error);
      return {
        success: false,
        error: error.message,
        executionTime: Date.now() - startTime
      };
    }
  }

  async performFacetedSearch(query, options = {}) {
    console.log(`Performing faceted search for: "${query}"`);

    const startTime = Date.now();

    try {
      const facetedSearchPipeline = [
        {
          $searchMeta: {
            index: 'faceted_search_index',
            facet: {
              operator: {
                text: {
                  query: query,
                  path: ['title', 'content']
                }
              },
              facets: {
                // Category facets
                categoriesFacet: {
                  type: 'string',
                  path: 'category',
                  numBuckets: this.config.facetLimit
                },

                // Author facets
                authorsFacet: {
                  type: 'string',
                  path: 'author',
                  numBuckets: 10
                },

                // Language facets
                languagesFacet: {
                  type: 'string',
                  path: 'language',
                  numBuckets: 10
                },

                // Date range facets
                publishDateFacet: {
                  type: 'date',
                  path: 'publishDate',
                  boundaries: [
                    new Date('2020-01-01'),
                    new Date('2021-01-01'),
                    new Date('2022-01-01'),
                    new Date('2023-01-01'),
                    new Date('2024-01-01'),
                    new Date('2025-01-01')
                  ]
                },

                // Popularity range facets
                popularityFacet: {
                  type: 'number',
                  path: 'popularity',
                  boundaries: [0, 10, 50, 100, 500, 1000]
                },

                // Content length facets
                contentLengthFacet: {
                  type: 'number',
                  path: 'contentLength',
                  boundaries: [0, 1000, 5000, 10000, 50000]
                }
              }
            }
          }
        }
      ];

      const facetResults = await this.collections.documents.aggregate(
        facetedSearchPipeline
      ).toArray();

      const executionTime = Date.now() - startTime;

      console.log(`Faceted search completed in ${executionTime}ms`);

      return {
        success: true,
        query: query,
        executionTime: executionTime,
        facets: facetResults[0]?.facet || {},
        searchType: 'faceted'
      };

    } catch (error) {
      console.error('Error performing faceted search:', error);
      return {
        success: false,
        error: error.message,
        executionTime: Date.now() - startTime
      };
    }
  }

  async generateAutocompleteResults(partialQuery, options = {}) {
    console.log(`Generating autocomplete for: "${partialQuery}"`);

    try {
      const autocompletePipeline = [
        {
          $search: {
            index: 'autocomplete_search_index',
            compound: {
              should: [
                {
                  autocomplete: {
                    query: partialQuery,
                    path: 'title',
                    tokenOrder: 'sequential',
                    score: { boost: { value: 2.0 } }
                  }
                },
                {
                  autocomplete: {
                    query: partialQuery,
                    path: 'tags',
                    tokenOrder: 'any',
                    score: { boost: { value: 1.5 } }
                  }
                }
              ],
              filter: [
                { equals: { path: 'status', value: 'published' } },
                ...(options.category ? [{ equals: { path: 'category', value: options.category } }] : [])
              ]
            }
          }
        },

        { $limit: 10 },

        {
          $project: {
            suggestion: '$title',
            category: 1,
            popularity: 1,
            autocompleteScore: { $meta: 'searchScore' }
          }
        },

        { $sort: { autocompleteScore: -1, popularity: -1 } }
      ];

      const suggestions = await this.collections.documents.aggregate(
        autocompletePipeline
      ).toArray();

      return {
        success: true,
        partialQuery: partialQuery,
        suggestions: suggestions.map(s => ({
          text: s.suggestion,
          category: s.category,
          score: s.autocompleteScore
        }))
      };

    } catch (error) {
      console.error('Error generating autocomplete results:', error);
      return {
        success: false,
        error: error.message,
        suggestions: []
      };
    }
  }

  async logSearchQuery(query, resultCount, executionTime, options) {
    try {
      const searchLog = {
        queryId: new ObjectId(),
        queryText: query,
        queryLanguage: options.language || 'en',
        userId: options.userId,
        sessionId: options.sessionId,

        // Search parameters
        filtersApplied: {
          category: options.category,
          author: options.author,
          language: options.language,
          dateRange: options.dateRange
        },

        // Search results metrics
        totalResultsFound: resultCount,
        executionTimeMs: executionTime,
        searchType: options.searchType || 'text',

        // Context information
        userAgent: options.userAgent,
        referrer: options.referrer,
        searchTimestamp: new Date(),

        // Performance data
        indexesUsed: ['documents_search_index'],
        optimizationsApplied: ['boosting', 'highlighting', 'fuzzy_matching'],

        // Quality metrics (to be updated by user interaction)
        userInteraction: {
          resultsClicked: [],
          timeOnResultsPage: null,
          refinedQuery: null,
          conversionAchieved: false
        }
      };

      await this.collections.searchQueries.insertOne(searchLog);

    } catch (error) {
      console.error('Error logging search query:', error);
    }
  }

  updateSearchMetrics(query, resultCount, executionTime) {
    this.searchMetrics.totalSearches++;
    this.searchMetrics.averageLatency = 
      (this.searchMetrics.averageLatency + executionTime) / 2;

    if (resultCount > 0) {
      this.searchMetrics.searchesWithResults++;
    }

    // Track popular queries
    const queryLower = query.toLowerCase();
    this.searchMetrics.popularQueries.set(
      queryLower,
      (this.searchMetrics.popularQueries.get(queryLower) || 0) + 1
    );
  }

  async getQueryEmbedding(query) {
    // Placeholder for actual embedding generation
    // In production, this would call OpenAI API or similar service
    return Array(1536).fill(0).map(() => Math.random() - 0.5);
  }

  async getSearchAnalytics(timeRange = '7d') {
    console.log(`Retrieving search analytics for ${timeRange}...`);

    try {
      const endDate = new Date();
      const startDate = new Date();

      switch (timeRange) {
        case '1d':
          startDate.setDate(endDate.getDate() - 1);
          break;
        case '7d':
          startDate.setDate(endDate.getDate() - 7);
          break;
        case '30d':
          startDate.setDate(endDate.getDate() - 30);
          break;
        default:
          startDate.setDate(endDate.getDate() - 7);
      }

      const analyticsAggregation = [
        {
          $match: {
            searchTimestamp: { $gte: startDate, $lte: endDate }
          }
        },

        {
          $group: {
            _id: null,
            totalSearches: { $sum: 1 },
            uniqueUsers: { $addToSet: '$userId' },
            averageExecutionTime: { $avg: '$executionTimeMs' },
            searchesWithResults: {
              $sum: { $cond: [{ $gt: ['$totalResultsFound', 0] }, 1, 0] }
            },

            // Query analysis
            popularQueries: {
              $push: {
                query: '$queryText',
                results: '$totalResultsFound',
                executionTime: '$executionTimeMs'
              }
            },

            // Performance metrics
            maxExecutionTime: { $max: '$executionTimeMs' },
            minExecutionTime: { $min: '$executionTimeMs' },

            // Filter usage analysis
            categoryFilters: { $push: '$filtersApplied.category' },
            languageFilters: { $push: '$filtersApplied.language' }
          }
        },

        {
          $addFields: {
            uniqueUserCount: { $size: '$uniqueUsers' },
            successRate: {
              $round: [
                { $multiply: [
                  { $divide: ['$searchesWithResults', '$totalSearches'] },
                  100
                ]},
                2
              ]
            },
            averageExecutionTimeRounded: {
              $round: ['$averageExecutionTime', 2]
            }
          }
        }
      ];

      const analytics = await this.collections.searchQueries.aggregate(
        analyticsAggregation
      ).toArray();

      return {
        success: true,
        timeRange: timeRange,
        analytics: analytics[0] || {
          totalSearches: 0,
          uniqueUserCount: 0,
          successRate: 0,
          averageExecutionTimeRounded: 0
        },
        systemMetrics: this.searchMetrics
      };

    } catch (error) {
      console.error('Error retrieving search analytics:', error);
      return {
        success: false,
        error: error.message
      };
    }
  }

  async shutdown() {
    console.log('Shutting down Atlas Search Manager...');

    if (this.client) {
      await this.client.close();
    }

    console.log('Atlas Search Manager shutdown complete');
  }
}

// Benefits of MongoDB Atlas Search:
// - Native full-text search with no external dependencies
// - Advanced relevance scoring with machine learning integration
// - Vector similarity search for semantic understanding
// - Multi-language support with sophisticated text analysis
// - Real-time search index synchronization
// - Faceted search and advanced filtering capabilities
// - Autocomplete and search suggestions out-of-the-box
// - Comprehensive search analytics and performance monitoring
// - SQL-compatible search operations through QueryLeaf integration

module.exports = {
  AtlasSearchManager
};

Understanding MongoDB Atlas Search Architecture

Advanced Search Patterns and Performance Optimization

Implement sophisticated search strategies for production MongoDB Atlas deployments:

// Production-ready Atlas Search with advanced features and optimization
class EnterpriseAtlasSearchProcessor extends AtlasSearchManager {
  constructor(connectionString, enterpriseConfig) {
    super(connectionString, enterpriseConfig);

    this.enterpriseConfig = {
      ...enterpriseConfig,
      enableAdvancedAnalytics: true,
      enablePersonalization: true,
      enableA_B_Testing: true,
      enableSearchOptimization: true,
      enableContentIntelligence: true,
      enableMultiModalSearch: true
    };

    this.setupEnterpriseFeatures();
    this.initializeAdvancedAnalytics();
    this.setupPersonalizationEngine();
  }

  async implementAdvancedSearchStrategies() {
    console.log('Implementing enterprise search strategies...');

    const searchStrategies = {
      // Multi-modal search capabilities
      multiModalSearch: {
        textSearch: true,
        vectorSearch: true,
        imageSearch: true,
        documentSearch: true,
        semanticSearch: true
      },

      // Personalization engine
      personalizationEngine: {
        userBehaviorAnalysis: true,
        contentRecommendations: true,
        adaptiveScoringWeights: true,
        searchIntentPrediction: true
      },

      // Search optimization
      searchOptimization: {
        realTimeIndexOptimization: true,
        queryPerformanceAnalysis: true,
        automaticRelevanceTuning: true,
        resourceUtilizationOptimization: true
      }
    };

    return await this.deployEnterpriseSearchStrategies(searchStrategies);
  }

  async setupAdvancedPersonalization() {
    console.log('Setting up advanced personalization capabilities...');

    const personalizationConfig = {
      // User modeling
      userModeling: {
        behavioralTracking: true,
        preferenceAnalysis: true,
        contextualUnderstanding: true,
        intentPrediction: true
      },

      // Content intelligence
      contentIntelligence: {
        topicModeling: true,
        contentCategorization: true,
        qualityScoring: true,
        freshnessScorig: true
      },

      // Adaptive algorithms
      adaptiveAlgorithms: {
        learningFromInteraction: true,
        realTimeAdaptation: true,
        contextualAdjustment: true,
        performanceOptimization: true
      }
    };

    return await this.deployPersonalizationEngine(personalizationConfig);
  }
}

SQL-Style Search Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB Atlas Search operations:

-- QueryLeaf Atlas Search operations with SQL-familiar syntax

-- Configure comprehensive search indexes
CREATE SEARCH INDEX documents_main_index ON documents (
  title WITH (
    analyzer = 'standard',
    search_analyzer = 'standard',
    highlight = true,
    boost = 2.0
  ),
  content WITH (
    analyzer = 'standard', 
    search_analyzer = 'standard',
    highlight = true,
    max_highlight_chars = 500
  ),
  author WITH (
    analyzer = 'keyword',
    facet = true
  ),
  category WITH (
    analyzer = 'keyword',
    facet = true
  ),
  tags WITH (
    analyzer = 'standard',
    facet = true
  ),
  language WITH (
    analyzer = 'keyword',
    facet = true
  ),
  publish_date WITH (
    type = 'date',
    facet = true
  ),
  popularity WITH (
    type = 'number',
    facet = true,
    facet_boundaries = [0, 10, 50, 100, 500, 1000]
  )
)
WITH SEARCH_OPTIONS (
  enable_highlighting = true,
  enable_faceting = true,
  enable_autocomplete = true,
  enable_fuzzy_matching = true,
  default_language = 'english'
);

-- Create autocomplete search index
CREATE AUTOCOMPLETE INDEX documents_autocomplete ON documents (
  title WITH (
    tokenization = 'edgeGram',
    min_grams = 2,
    max_grams = 15,
    fold_diacritics = true
  ),
  tags WITH (
    tokenization = 'keyword',
    max_suggestions = 20
  )
);

-- Create vector search index for semantic search
CREATE VECTOR INDEX documents_semantic ON documents (
  content_embedding WITH (
    dimensions = 1536,
    similarity = 'cosine'
  )
)
WITH VECTOR_OPTIONS (
  num_candidates = 100,
  enable_filtering = true
);

-- Advanced text search with comprehensive features
WITH advanced_search AS (
  SELECT 
    document_id,
    title,
    content,
    author,
    category,
    tags,
    publish_date,
    language,
    popularity,

    -- Search scoring and ranking
    SEARCH_SCORE() as relevance_score,
    SEARCH_HIGHLIGHTS(title, content) as search_highlights,

    -- Advanced scoring components
    CASE 
      WHEN SEARCH_EXACT_MATCH(title, 'machine learning') THEN 3.0
      WHEN SEARCH_PHRASE_MATCH(content, 'machine learning') THEN 2.5
      WHEN SEARCH_FUZZY_MATCH(title, 'machine learning', max_edits = 2) THEN 1.8
      ELSE 1.0
    END as match_type_boost,

    -- Temporal and popularity boosts
    CASE 
      WHEN publish_date >= CURRENT_DATE - INTERVAL '30 days' THEN 1.3
      WHEN publish_date >= CURRENT_DATE - INTERVAL '90 days' THEN 1.1
      ELSE 1.0
    END as recency_boost,

    CASE 
      WHEN popularity >= 1000 THEN 1.4
      WHEN popularity >= 100 THEN 1.2
      WHEN popularity >= 10 THEN 1.1
      ELSE 1.0
    END as popularity_boost,

    -- Content quality indicators
    LENGTH(content) as content_length,
    ARRAY_LENGTH(tags, 1) as tag_count,
    EXTRACT(DAYS FROM CURRENT_DATE - publish_date) as days_old

  FROM documents
  WHERE SEARCH(
    -- Primary search query
    query = 'machine learning artificial intelligence',
    paths = ['title', 'content'],

    -- Search options
    WITH (
      fuzzy_matching = true,
      max_edits = 2,
      prefix_length = 2,
      enable_highlighting = true,
      highlight_max_chars = 500,

      -- Boost strategies
      title_boost = 2.0,
      exact_phrase_boost = 3.0,
      proximity_boost = 1.5
    ),

    -- Filters
    AND category IN ('technology', 'science', 'research')
    AND language = 'en'
    AND status = 'published'
    AND publish_date >= '2020-01-01'
  )
),

search_with_personalization AS (
  SELECT 
    ads.*,

    -- User personalization (if user context available)
    CASE 
      WHEN USER_PREFERENCE_MATCH(category, user_id = 'user123') THEN 1.4
      WHEN USER_INTERACTION_HISTORY(document_id, user_id = 'user123', 
                                   interaction_type = 'positive') THEN 1.3
      ELSE 1.0
    END as personalization_boost,

    -- Final relevance calculation
    (relevance_score * match_type_boost * recency_boost * 
     popularity_boost * personalization_boost) as final_relevance_score,

    -- Search result enrichment
    CASE 
      WHEN final_relevance_score >= 8.0 THEN 'excellent'
      WHEN final_relevance_score >= 5.0 THEN 'very_good'
      WHEN final_relevance_score >= 3.0 THEN 'good'
      WHEN final_relevance_score >= 1.0 THEN 'fair'
      ELSE 'poor'
    END as match_quality,

    -- Estimated reading time
    ROUND(content_length / 200.0, 0) as estimated_reading_minutes,

    -- Search result categories
    CASE 
      WHEN SEARCH_EXACT_MATCH(title, 'machine learning') OR 
           SEARCH_EXACT_MATCH(content, 'machine learning') THEN 'exact_match'
      WHEN SEARCH_SEMANTIC_SIMILARITY(content_embedding, 
                                      QUERY_EMBEDDING('machine learning artificial intelligence')) > 0.8 
           THEN 'semantic_match'
      WHEN SEARCH_FUZZY_MATCH(title, 'machine learning', max_edits = 2) THEN 'fuzzy_match'
      ELSE 'keyword_match'
    END as match_type

  FROM advanced_search ads
),

faceted_analysis AS (
  -- Generate search facets for filtering UI
  SELECT 
    'categories' as facet_type,
    category as facet_value,
    COUNT(*) as result_count,
    AVG(final_relevance_score) as avg_relevance
  FROM search_with_personalization
  GROUP BY category

  UNION ALL

  SELECT 
    'authors' as facet_type,
    author as facet_value,
    COUNT(*) as result_count,
    AVG(final_relevance_score) as avg_relevance
  FROM search_with_personalization
  GROUP BY author

  UNION ALL

  SELECT 
    'languages' as facet_type,
    language as facet_value,
    COUNT(*) as result_count,
    AVG(final_relevance_score) as avg_relevance
  FROM search_with_personalization
  GROUP BY language

  UNION ALL

  SELECT 
    'time_periods' as facet_type,
    CASE 
      WHEN publish_date >= CURRENT_DATE - INTERVAL '30 days' THEN 'last_month'
      WHEN publish_date >= CURRENT_DATE - INTERVAL '90 days' THEN 'last_3_months'
      WHEN publish_date >= CURRENT_DATE - INTERVAL '365 days' THEN 'last_year'
      ELSE 'older'
    END as facet_value,
    COUNT(*) as result_count,
    AVG(final_relevance_score) as avg_relevance
  FROM search_with_personalization
  GROUP BY facet_value

  UNION ALL

  SELECT 
    'popularity_ranges' as facet_type,
    CASE 
      WHEN popularity >= 1000 THEN 'very_popular'
      WHEN popularity >= 100 THEN 'popular'
      WHEN popularity >= 10 THEN 'moderate'
      ELSE 'emerging'
    END as facet_value,
    COUNT(*) as result_count,
    AVG(final_relevance_score) as avg_relevance
  FROM search_with_personalization
  GROUP BY facet_value
),

search_analytics AS (
  -- Real-time search analytics
  SELECT 
    'search_performance' as metric_type,
    COUNT(*) as total_results,
    AVG(final_relevance_score) as avg_relevance,
    MAX(final_relevance_score) as max_relevance,
    COUNT(*) FILTER (WHERE match_quality IN ('excellent', 'very_good')) as high_quality_results,
    COUNT(DISTINCT category) as categories_represented,
    COUNT(DISTINCT author) as authors_represented,
    COUNT(DISTINCT language) as languages_represented,

    -- Match type distribution
    COUNT(*) FILTER (WHERE match_type = 'exact_match') as exact_matches,
    COUNT(*) FILTER (WHERE match_type = 'semantic_match') as semantic_matches,
    COUNT(*) FILTER (WHERE match_type = 'fuzzy_match') as fuzzy_matches,
    COUNT(*) FILTER (WHERE match_type = 'keyword_match') as keyword_matches,

    -- Content characteristics
    AVG(content_length) as avg_content_length,
    AVG(estimated_reading_minutes) as avg_reading_time,
    AVG(days_old) as avg_content_age_days,

    -- Search quality indicators
    ROUND((COUNT(*) FILTER (WHERE match_quality IN ('excellent', 'very_good'))::DECIMAL / COUNT(*)) * 100, 2) as high_quality_percentage,
    ROUND((COUNT(*) FILTER (WHERE final_relevance_score >= 3.0)::DECIMAL / COUNT(*)) * 100, 2) as relevant_results_percentage

  FROM search_with_personalization
)

-- Main search results output
SELECT 
  swp.document_id,
  swp.title,
  LEFT(swp.content, 300) || '...' as content_preview,
  swp.author,
  swp.category,
  swp.tags,
  swp.publish_date,
  swp.language,

  -- Relevance and ranking
  ROUND(swp.final_relevance_score, 4) as relevance_score,
  ROW_NUMBER() OVER (ORDER BY swp.final_relevance_score DESC, swp.publish_date DESC) as search_rank,
  swp.match_quality,
  swp.match_type,

  -- Search highlights
  swp.search_highlights,

  -- Content metadata
  swp.content_length,
  swp.estimated_reading_minutes,
  swp.tag_count,
  swp.days_old,

  -- User personalization indicators
  ROUND(swp.personalization_boost, 2) as personalization_factor,

  -- Additional context
  CASE 
    WHEN swp.days_old <= 7 THEN 'Very Recent'
    WHEN swp.days_old <= 30 THEN 'Recent'
    WHEN swp.days_old <= 90 THEN 'Moderate'
    ELSE 'Archive'
  END as content_freshness,

  -- Search result recommendations
  CASE 
    WHEN swp.match_quality = 'excellent' AND swp.match_type = 'exact_match' THEN 'Must Read'
    WHEN swp.match_quality IN ('very_good', 'excellent') AND swp.days_old <= 30 THEN 'Trending'
    WHEN swp.match_quality = 'good' AND swp.popularity >= 100 THEN 'Popular Choice'
    WHEN swp.match_type = 'semantic_match' THEN 'Related Content'
    ELSE 'Standard Result'
  END as result_recommendation

FROM search_with_personalization swp
WHERE swp.final_relevance_score >= 0.5  -- Filter low-relevance results
ORDER BY swp.final_relevance_score DESC, swp.publish_date DESC
LIMIT 50;

-- Vector similarity search with SQL syntax
WITH semantic_search AS (
  SELECT 
    document_id,
    title,
    content,
    author,
    category,

    -- Vector similarity scoring
    VECTOR_SIMILARITY(
      content_embedding, 
      QUERY_EMBEDDING('artificial intelligence machine learning deep learning neural networks'),
      similarity_method = 'cosine'
    ) as semantic_similarity_score,

    -- Semantic relevance classification
    CASE 
      WHEN VECTOR_SIMILARITY(content_embedding, QUERY_EMBEDDING(...)) >= 0.9 THEN 'extremely_relevant'
      WHEN VECTOR_SIMILARITY(content_embedding, QUERY_EMBEDDING(...)) >= 0.8 THEN 'highly_relevant'
      WHEN VECTOR_SIMILARITY(content_embedding, QUERY_EMBEDDING(...)) >= 0.7 THEN 'relevant'
      WHEN VECTOR_SIMILARITY(content_embedding, QUERY_EMBEDDING(...)) >= 0.6 THEN 'somewhat_relevant'
      ELSE 'marginally_relevant'
    END as semantic_relevance_level

  FROM documents
  WHERE VECTOR_SEARCH(
    embedding_field = content_embedding,
    query_vector = QUERY_EMBEDDING('artificial intelligence machine learning deep learning neural networks'),
    similarity_threshold = 0.6,
    max_results = 20,

    -- Additional filters
    AND status = 'published'
    AND language IN ('en', 'es', 'fr')
    AND publish_date >= '2021-01-01'
  )
),

hybrid_search_results AS (
  -- Combine text search and vector search for optimal results
  SELECT 
    document_id,
    title,
    content,
    author,
    category,
    publish_date,

    -- Combined scoring from multiple search methods
    COALESCE(text_search.final_relevance_score, 0) as text_relevance,
    COALESCE(semantic_search.semantic_similarity_score, 0) as semantic_relevance,

    -- Hybrid relevance calculation
    (
      COALESCE(text_search.final_relevance_score, 0) * 0.6 +
      COALESCE(semantic_search.semantic_similarity_score * 10, 0) * 0.4
    ) as hybrid_relevance_score,

    -- Search method indicators
    CASE 
      WHEN text_search.document_id IS NOT NULL AND semantic_search.document_id IS NOT NULL THEN 'hybrid_match'
      WHEN text_search.document_id IS NOT NULL THEN 'text_match'
      WHEN semantic_search.document_id IS NOT NULL THEN 'semantic_match'
      ELSE 'no_match'
    END as search_method,

    -- Quality indicators
    text_search.match_quality as text_match_quality,
    semantic_search.semantic_relevance_level as semantic_match_quality

  FROM (
    SELECT DISTINCT document_id FROM search_with_personalization 
    UNION 
    SELECT DISTINCT document_id FROM semantic_search
  ) all_results
  LEFT JOIN search_with_personalization text_search ON all_results.document_id = text_search.document_id
  LEFT JOIN semantic_search ON all_results.document_id = semantic_search.document_id
  JOIN documents d ON all_results.document_id = d.document_id
)

SELECT 
  hrs.document_id,
  hrs.title,
  LEFT(hrs.content, 400) as content_preview,
  hrs.author,
  hrs.category,
  hrs.publish_date,

  -- Hybrid scoring results
  ROUND(hrs.text_relevance, 4) as text_relevance_score,
  ROUND(hrs.semantic_relevance, 4) as semantic_relevance_score,
  ROUND(hrs.hybrid_relevance_score, 4) as combined_relevance_score,

  -- Search method and quality
  hrs.search_method,
  COALESCE(hrs.text_match_quality, 'n/a') as text_quality,
  COALESCE(hrs.semantic_match_quality, 'n/a') as semantic_quality,

  -- Final recommendation
  CASE 
    WHEN hrs.hybrid_relevance_score >= 8.0 THEN 'Highly Recommended'
    WHEN hrs.hybrid_relevance_score >= 6.0 THEN 'Recommended'
    WHEN hrs.hybrid_relevance_score >= 4.0 THEN 'Relevant'
    WHEN hrs.hybrid_relevance_score >= 2.0 THEN 'Potentially Interesting'
    ELSE 'Marginally Relevant'
  END as recommendation_level

FROM hybrid_search_results hrs
WHERE hrs.hybrid_relevance_score >= 1.0
ORDER BY hrs.hybrid_relevance_score DESC, hrs.publish_date DESC
LIMIT 25;

-- Autocomplete and search suggestions
SELECT 
  suggestion_text,
  suggestion_category,
  popularity_score,
  completion_frequency,

  -- Suggestion quality metrics
  AUTOCOMPLETE_SCORE('machine lear', suggestion_text) as completion_relevance,

  -- Suggestion type classification
  CASE 
    WHEN STARTS_WITH(suggestion_text, 'machine lear') THEN 'prefix_completion'
    WHEN CONTAINS(suggestion_text, 'machine learning') THEN 'phrase_completion'
    WHEN FUZZY_MATCH(suggestion_text, 'machine learning', max_distance = 2) THEN 'corrected_completion'
    ELSE 'related_suggestion'
  END as suggestion_type,

  -- User context enhancement
  CASE 
    WHEN USER_SEARCH_HISTORY_CONTAINS('user123', suggestion_text) THEN true
    ELSE false
  END as user_has_searched_before,

  -- Trending indicator
  CASE 
    WHEN TRENDING_SEARCH_TERM(suggestion_text, time_window = '7d') THEN 'trending'
    WHEN POPULAR_SEARCH_TERM(suggestion_text, time_window = '30d') THEN 'popular'
    ELSE 'standard'
  END as trend_status

FROM AUTOCOMPLETE_SUGGESTIONS(
  partial_query = 'machine lear',
  max_suggestions = 10,

  -- Personalization options
  user_id = 'user123',
  include_user_history = true,
  include_trending = true,

  -- Filtering options
  category_filter = 'technology',
  language_filter = 'en',
  min_popularity = 10
)
ORDER BY completion_relevance DESC, popularity_score DESC;

-- Search analytics and performance monitoring
WITH search_performance_analysis AS (
  SELECT 
    DATE_TRUNC('hour', search_timestamp) as hour_bucket,
    COUNT(*) as total_searches,
    COUNT(DISTINCT user_id) as unique_users,
    AVG(execution_time_ms) as avg_execution_time,
    AVG(total_results_found) as avg_results_count,

    -- Search success metrics
    COUNT(*) FILTER (WHERE total_results_found > 0) as successful_searches,
    COUNT(*) FILTER (WHERE total_results_found >= 10) as highly_successful_searches,

    -- Query complexity analysis
    AVG(LENGTH(query_text)) as avg_query_length,
    COUNT(*) FILTER (WHERE filters_applied IS NOT NULL) as searches_with_filters,

    -- Performance categories
    COUNT(*) FILTER (WHERE execution_time_ms <= 100) as fast_searches,
    COUNT(*) FILTER (WHERE execution_time_ms > 100 AND execution_time_ms <= 500) as moderate_searches,
    COUNT(*) FILTER (WHERE execution_time_ms > 500) as slow_searches,

    -- Search types
    COUNT(*) FILTER (WHERE search_type = 'text') as text_searches,
    COUNT(*) FILTER (WHERE search_type = 'vector') as vector_searches,
    COUNT(*) FILTER (WHERE search_type = 'hybrid') as hybrid_searches,
    COUNT(*) FILTER (WHERE search_type = 'autocomplete') as autocomplete_requests

  FROM search_queries
  WHERE search_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  GROUP BY DATE_TRUNC('hour', search_timestamp)
),

query_pattern_analysis AS (
  SELECT 
    query_text,
    COUNT(*) as query_frequency,
    AVG(total_results_found) as avg_results,
    AVG(execution_time_ms) as avg_execution_time,
    COUNT(DISTINCT user_id) as unique_users,

    -- Query success metrics
    ROUND((COUNT(*) FILTER (WHERE total_results_found > 0)::DECIMAL / COUNT(*)) * 100, 2) as success_rate,

    -- User engagement indicators
    AVG(ARRAY_LENGTH(user_interaction.results_clicked, 1)) as avg_clicks_per_search,
    COUNT(*) FILTER (WHERE user_interaction.conversion_achieved = true) as conversions,

    -- Query characteristics
    LENGTH(query_text) as query_length,
    ARRAY_LENGTH(STRING_TO_ARRAY(query_text, ' '), 1) as word_count,

    -- Classification
    CASE 
      WHEN LENGTH(query_text) <= 10 THEN 'short_query'
      WHEN LENGTH(query_text) <= 30 THEN 'medium_query'
      ELSE 'long_query'
    END as query_length_category,

    CASE 
      WHEN ARRAY_LENGTH(STRING_TO_ARRAY(query_text, ' '), 1) = 1 THEN 'single_word'
      WHEN ARRAY_LENGTH(STRING_TO_ARRAY(query_text, ' '), 1) <= 3 THEN 'short_phrase'
      WHEN ARRAY_LENGTH(STRING_TO_ARRAY(query_text, ' '), 1) <= 6 THEN 'medium_phrase'
      ELSE 'long_phrase'
    END as query_complexity

  FROM search_queries
  WHERE search_timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days'
  GROUP BY query_text
  HAVING COUNT(*) >= 3  -- Focus on repeated queries
)

-- Comprehensive search analytics report
SELECT 
  -- Time-based performance
  spa.hour_bucket,
  spa.total_searches,
  spa.unique_users,
  spa.avg_execution_time,
  spa.avg_results_count,

  -- Success metrics
  ROUND((spa.successful_searches::DECIMAL / spa.total_searches) * 100, 2) as success_rate_percent,
  ROUND((spa.highly_successful_searches::DECIMAL / spa.total_searches) * 100, 2) as high_success_rate_percent,

  -- Performance distribution
  ROUND((spa.fast_searches::DECIMAL / spa.total_searches) * 100, 2) as fast_search_percent,
  ROUND((spa.moderate_searches::DECIMAL / spa.total_searches) * 100, 2) as moderate_search_percent,
  ROUND((spa.slow_searches::DECIMAL / spa.total_searches) * 100, 2) as slow_search_percent,

  -- Search type distribution
  ROUND((spa.text_searches::DECIMAL / spa.total_searches) * 100, 2) as text_search_percent,
  ROUND((spa.vector_searches::DECIMAL / spa.total_searches) * 100, 2) as vector_search_percent,
  ROUND((spa.hybrid_searches::DECIMAL / spa.total_searches) * 100, 2) as hybrid_search_percent,

  -- User engagement
  ROUND(spa.searches_with_filters::DECIMAL / spa.total_searches * 100, 2) as filter_usage_percent,
  spa.avg_query_length,

  -- Performance assessment
  CASE 
    WHEN spa.avg_execution_time <= 100 THEN 'excellent'
    WHEN spa.avg_execution_time <= 300 THEN 'good'
    WHEN spa.avg_execution_time <= 800 THEN 'fair'
    ELSE 'needs_improvement'
  END as performance_rating,

  -- System health indicators
  CASE 
    WHEN (spa.successful_searches::DECIMAL / spa.total_searches) >= 0.9 THEN 'healthy'
    WHEN (spa.successful_searches::DECIMAL / spa.total_searches) >= 0.7 THEN 'moderate'
    ELSE 'concerning'
  END as system_health_status

FROM search_performance_analysis spa
ORDER BY spa.hour_bucket DESC;

-- Popular and problematic queries analysis
SELECT 
  'popular_queries' as analysis_type,
  qpa.query_text,
  qpa.query_frequency,
  qpa.success_rate,
  qpa.avg_results,
  qpa.avg_execution_time,
  qpa.unique_users,
  qpa.query_length_category,
  qpa.query_complexity,

  -- Recommendations
  CASE 
    WHEN qpa.success_rate < 50 THEN 'Investigate low success rate'
    WHEN qpa.avg_execution_time > 1000 THEN 'Optimize query performance'
    WHEN qpa.avg_results < 5 THEN 'Improve result relevance'
    WHEN qpa.conversions = 0 THEN 'Enhance result quality'
    ELSE 'Query performing well'
  END as recommendation

FROM query_pattern_analysis qpa
WHERE qpa.query_frequency >= 10
ORDER BY qpa.query_frequency DESC
LIMIT 20;

-- QueryLeaf provides comprehensive search capabilities:
-- 1. SQL-familiar syntax for Atlas Search index creation and management
-- 2. Advanced full-text search with fuzzy matching, highlighting, and boosting
-- 3. Vector similarity search for semantic understanding
-- 4. Faceted search and filtering with automatic facet generation
-- 5. Autocomplete and search suggestions with personalization
-- 6. Hybrid search combining multiple search methodologies
-- 7. Real-time search analytics and performance monitoring
-- 8. Integration with MongoDB's native Atlas Search optimizations
-- 9. Multi-language support and advanced text analysis
-- 10. Production-ready search capabilities with familiar SQL syntax

Best Practices for Atlas Search Implementation

Search Index Strategy and Performance Optimization

Essential principles for effective Atlas Search deployment:

Index Design: Create search indexes that balance functionality with performance, optimizing for your most common query patterns
Query Optimization: Structure search queries to leverage Atlas Search's advanced capabilities while maintaining fast response times
Relevance Tuning: Implement sophisticated relevance scoring that combines multiple factors for optimal search results
Multi-Language Support: Design search indexes and queries to handle multiple languages and character sets effectively
Performance Monitoring: Establish comprehensive search analytics to track performance and user behavior
Vector Integration: Leverage vector search for semantic understanding and enhanced search relevance

Production Search Architecture

Design search systems for enterprise-scale requirements:

Scalable Architecture: Implement search infrastructure that can handle high query volumes and large datasets
Advanced Analytics: Deploy comprehensive search analytics with user behavior tracking and performance optimization
Personalization Engine: Integrate machine learning-based personalization for improved search relevance
Multi-Modal Search: Support various search types including text, semantic, and multimedia search capabilities
Real-Time Optimization: Implement automated search optimization based on usage patterns and performance metrics
Security Integration: Ensure search implementations respect data access controls and privacy requirements

Conclusion

MongoDB Atlas Search provides comprehensive native search capabilities that eliminate the complexity of external search engines through advanced text indexing, vector similarity search, and intelligent relevance scoring integrated directly within MongoDB. The combination of full-text search with semantic understanding, multi-language support, and real-time synchronization makes Atlas Search ideal for modern applications requiring sophisticated search experiences.

Key Atlas Search benefits include:

Native Integration: Seamless search capabilities without external dependencies or complex data synchronization
Advanced Text Analysis: Comprehensive full-text search with fuzzy matching, highlighting, and multi-language support
Vector Similarity: Semantic search capabilities using machine learning embeddings for contextual understanding
Real-Time Synchronization: Instant search index updates without manual refresh or batch processing
Faceted Search: Advanced filtering and categorization capabilities for enhanced user search experiences
SQL Accessibility: Familiar SQL-style search operations through QueryLeaf for accessible search implementation

Whether you're building content management systems, e-commerce platforms, knowledge bases, or enterprise search applications, MongoDB Atlas Search with QueryLeaf's familiar SQL interface provides the foundation for powerful, scalable search experiences.

QueryLeaf Integration: QueryLeaf seamlessly manages MongoDB Atlas Search operations while providing SQL-familiar search syntax, index management, and advanced search query construction. Sophisticated search patterns including full-text search, vector similarity, faceted filtering, and search analytics are elegantly handled through familiar SQL constructs, making advanced search capabilities both powerful and accessible to SQL-oriented development teams.

The combination of MongoDB's robust Atlas Search capabilities with SQL-style search operations makes it an ideal platform for applications requiring both advanced search functionality and familiar database interaction patterns, ensuring your search implementations remain both sophisticated and maintainable as your search requirements evolve and scale.

November 4, 2025
26 min read

MongoDB Geospatial Queries and Location-Based Services: Advanced Spatial Indexing and Geographic Data Management

Modern applications increasingly rely on location-aware functionality, from ride-sharing and delivery services to social media check-ins and targeted marketing. Traditional database systems struggle with complex spatial operations, often requiring specialized GIS software or complex geometric calculations that are difficult to integrate, maintain, and scale within application architectures.

MongoDB provides comprehensive native geospatial capabilities with advanced spatial indexing, sophisticated geometric operations, and high-performance location-based queries that eliminate the complexity of external GIS systems. Unlike traditional approaches that require separate spatial databases or complex geometric libraries, MongoDB's integrated geospatial features deliver superior performance through optimized spatial indexes, native coordinate system support, and seamless integration with application data models.

The Traditional Geospatial Challenge

Conventional approaches to location-based services involve significant complexity and performance limitations:

-- Traditional PostgreSQL geospatial approach - complex setup and limited optimization

-- PostGIS extension required for spatial capabilities
CREATE EXTENSION IF NOT EXISTS postgis;

-- Location-based entities with complex geometric types
CREATE TABLE locations (
    location_id BIGSERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    category VARCHAR(100) NOT NULL,

    -- PostGIS geometry types (complex to work with)
    coordinates GEOMETRY(POINT, 4326) NOT NULL, -- WGS84 coordinate system
    coverage_area GEOMETRY(POLYGON, 4326),
    search_radius GEOMETRY(POLYGON, 4326),

    -- Additional location metadata
    address TEXT,
    city VARCHAR(100),
    state VARCHAR(50),
    country VARCHAR(50),
    postal_code VARCHAR(20),

    -- Business information
    phone_number VARCHAR(20),
    operating_hours JSONB,
    rating DECIMAL(3,2),
    price_range INTEGER,

    -- Spatial analysis metadata
    population_density INTEGER,
    traffic_level VARCHAR(20),
    accessibility_score DECIMAL(4,2),

    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Complex spatial indexing (manual configuration required)
CREATE INDEX idx_locations_coordinates ON locations USING GIST (coordinates);
CREATE INDEX idx_locations_coverage ON locations USING GIST (coverage_area);
CREATE INDEX idx_locations_category_coords ON locations USING GIST (coordinates, category);

-- User location tracking with spatial relationships
CREATE TABLE user_locations (
    user_location_id BIGSERIAL PRIMARY KEY,
    user_id BIGINT NOT NULL,
    location_coordinates GEOMETRY(POINT, 4326) NOT NULL,
    accuracy_meters DECIMAL(8,2),
    altitude_meters DECIMAL(8,2),

    -- Movement tracking
    speed_kmh DECIMAL(6,2),
    heading_degrees DECIMAL(5,2),

    -- Context information
    location_method VARCHAR(50), -- GPS, WIFI, CELL, MANUAL
    device_type VARCHAR(50),
    battery_level INTEGER,

    -- Privacy and permissions
    location_sharing_level VARCHAR(20) DEFAULT 'private',
    geofence_notifications BOOLEAN DEFAULT false,

    -- Temporal tracking
    recorded_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    session_id VARCHAR(100),

    FOREIGN KEY (user_id) REFERENCES users(user_id)
);

-- Spatial indexes for user locations
CREATE INDEX idx_user_locations_coords ON user_locations USING GIST (location_coordinates);
CREATE INDEX idx_user_locations_user_time ON user_locations (user_id, recorded_at);
CREATE INDEX idx_user_locations_session ON user_locations (session_id, recorded_at);

-- Complex proximity search with performance issues
WITH nearby_locations AS (
    SELECT 
        l.location_id,
        l.name,
        l.category,
        l.address,
        l.rating,
        l.price_range,

        -- Complex distance calculations
        ST_Distance(
            l.coordinates, 
            ST_SetSRID(ST_MakePoint($longitude, $latitude), 4326)::geography
        ) as distance_meters,

        -- Geometric relationships (expensive operations)
        ST_Contains(l.coverage_area, ST_SetSRID(ST_MakePoint($longitude, $latitude), 4326)) as within_coverage,
        ST_Intersects(l.search_radius, ST_SetSRID(ST_MakePoint($longitude, $latitude), 4326)) as in_search_area,

        -- Bearing calculation (complex trigonometry)
        ST_Azimuth(
            l.coordinates, 
            ST_SetSRID(ST_MakePoint($longitude, $latitude), 4326)
        ) * 180 / PI() as bearing_degrees,

        -- Additional spatial analysis
        l.coordinates,
        l.operating_hours,
        l.phone_number

    FROM locations l
    WHERE 
        -- Basic distance filter (still expensive without proper optimization)
        ST_DWithin(
            l.coordinates::geography, 
            ST_SetSRID(ST_MakePoint($longitude, $latitude), 4326)::geography, 
            $search_radius_meters
        )

        -- Category filtering
        AND ($category IS NULL OR l.category = $category)

        -- Rating filtering
        AND ($min_rating IS NULL OR l.rating >= $min_rating)

        -- Price filtering
        AND ($max_price IS NULL OR l.price_range <= $max_price)

    ORDER BY distance_meters
    LIMIT $limit_count
),

location_analytics AS (
    -- Complex spatial aggregations with performance impact
    SELECT 
        nl.category,
        COUNT(*) as location_count,
        AVG(nl.rating) as avg_rating,
        AVG(nl.distance_meters) as avg_distance,
        MIN(nl.distance_meters) as closest_distance,
        MAX(nl.distance_meters) as furthest_distance,

        -- Expensive geometric calculations
        ST_ConvexHull(ST_Collect(nl.coordinates)) as coverage_polygon,
        ST_Centroid(ST_Collect(nl.coordinates)) as category_center,

        -- Statistical analysis (resource intensive)
        STDDEV_POP(nl.distance_meters) as distance_variance,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY nl.distance_meters) as median_distance

    FROM nearby_locations nl
    GROUP BY nl.category
),

user_movement_analysis AS (
    -- Track user movement patterns (very expensive queries)
    SELECT 
        ul.user_id,
        COUNT(*) as location_updates,

        -- Complex movement calculations
        SUM(
            ST_Distance(
                ul.location_coordinates::geography,
                LAG(ul.location_coordinates::geography) OVER (
                    PARTITION BY ul.user_id 
                    ORDER BY ul.recorded_at
                )
            )
        ) as total_distance_traveled,

        -- Speed analysis
        AVG(ul.speed_kmh) as avg_speed,
        MAX(ul.speed_kmh) as max_speed,

        -- Time-based analysis
        EXTRACT(SECONDS FROM (MAX(ul.recorded_at) - MIN(ul.recorded_at))) as session_duration_seconds,

        -- Geofencing analysis (complex polygon operations)
        COUNT(*) FILTER (
            WHERE EXISTS (
                SELECT 1 FROM locations l 
                WHERE ST_Contains(l.coverage_area, ul.location_coordinates)
            )
        ) as geofence_entries,

        -- Movement patterns
        STRING_AGG(
            DISTINCT CASE 
                WHEN ul.speed_kmh > 50 THEN 'highway'
                WHEN ul.speed_kmh > 20 THEN 'city'
                WHEN ul.speed_kmh > 5 THEN 'walking'
                ELSE 'stationary'
            END, 
            ',' 
            ORDER BY ul.recorded_at
        ) as movement_pattern

    FROM user_locations ul
    WHERE ul.recorded_at >= CURRENT_TIMESTAMP - INTERVAL '1 day'
    GROUP BY ul.user_id
)

-- Final complex spatial query with multiple joins and calculations
SELECT 
    nl.location_id,
    nl.name,
    nl.category,
    nl.address,
    ROUND(nl.distance_meters, 2) as distance_meters,
    ROUND(nl.bearing_degrees, 1) as bearing_degrees,
    nl.rating,
    nl.price_range,

    -- Spatial relationship indicators
    nl.within_coverage,
    nl.in_search_area,

    -- Analytics context
    la.location_count as similar_nearby_count,
    ROUND(la.avg_rating, 2) as category_avg_rating,
    ROUND(la.avg_distance, 2) as category_avg_distance,

    -- User movement context (if available)
    uma.total_distance_traveled,
    uma.avg_speed,
    uma.movement_pattern,

    -- Additional computed fields
    CASE 
        WHEN nl.distance_meters <= 100 THEN 'immediate_vicinity'
        WHEN nl.distance_meters <= 500 THEN 'very_close'
        WHEN nl.distance_meters <= 1000 THEN 'walking_distance'
        WHEN nl.distance_meters <= 5000 THEN 'short_drive'
        ELSE 'distant'
    END as proximity_category,

    -- Operating status (complex JSON processing)
    CASE 
        WHEN nl.operating_hours IS NULL THEN 'unknown'
        WHEN nl.operating_hours->>(EXTRACT(DOW FROM CURRENT_TIMESTAMP)::TEXT) IS NULL THEN 'closed'
        ELSE 'check_hours'
    END as operating_status,

    -- Recommendations based on multiple factors
    CASE 
        WHEN nl.rating >= 4.5 AND nl.distance_meters <= 1000 THEN 'highly_recommended'
        WHEN nl.rating >= 4.0 AND nl.distance_meters <= 2000 THEN 'recommended'
        WHEN nl.distance_meters <= 500 THEN 'convenient'
        ELSE 'standard'
    END as recommendation_level

FROM nearby_locations nl
LEFT JOIN location_analytics la ON nl.category = la.category
LEFT JOIN user_movement_analysis uma ON uma.user_id = $user_id
ORDER BY 
    -- Complex sorting logic
    CASE $sort_preference
        WHEN 'distance' THEN nl.distance_meters
        WHEN 'rating' THEN -nl.rating * 100 + nl.distance_meters
        WHEN 'price' THEN nl.price_range * 1000 + nl.distance_meters
        ELSE nl.distance_meters
    END
LIMIT $result_limit;

-- Traditional geospatial approach problems:
-- 1. Requires PostGIS extension and complex geometric type management
-- 2. Expensive spatial calculations with limited built-in optimization
-- 3. Complex coordinate system transformations and projections
-- 4. Poor performance with large datasets and concurrent spatial queries
-- 5. Limited integration with application data models and business logic
-- 6. Complex indexing strategies requiring deep GIS expertise
-- 7. Difficult to maintain and scale spatial operations
-- 8. Limited support for modern location-based service patterns
-- 9. Complex query syntax requiring specialized GIS knowledge
-- 10. Poor integration with real-time and streaming location data

MongoDB provides comprehensive geospatial capabilities with native optimization and seamless integration:

// MongoDB Advanced Geospatial Operations - native spatial capabilities with optimal performance
const { MongoClient, ObjectId } = require('mongodb');

const client = new MongoClient('mongodb://localhost:27017');
const db = client.db('location_services');

// Comprehensive MongoDB Geospatial Manager
class MongoDBGeospatialManager {
  constructor(db, config = {}) {
    this.db = db;
    this.config = {
      // Default search parameters
      defaultSearchRadius: config.defaultSearchRadius || 5000, // 5km
      defaultMaxResults: config.defaultMaxResults || 100,

      // Performance optimization
      enableSpatialIndexing: config.enableSpatialIndexing !== false,
      enableQueryOptimization: config.enableQueryOptimization !== false,
      enableBulkOperations: config.enableBulkOperations !== false,

      // Coordinate system configuration
      defaultCoordinateSystem: config.defaultCoordinateSystem || 'WGS84',
      enableEarthDistance: config.enableEarthDistance !== false,

      // Advanced features
      enableGeofencing: config.enableGeofencing !== false,
      enableLocationAnalytics: config.enableLocationAnalytics !== false,
      enableRealTimeTracking: config.enableRealTimeTracking !== false,
      enableSpatialAggregation: config.enableSpatialAggregation !== false,

      // Performance monitoring
      enablePerformanceMetrics: config.enablePerformanceMetrics !== false,
      logSlowQueries: config.logSlowQueries !== false,
      queryTimeoutMs: config.queryTimeoutMs || 30000,

      ...config
    };

    // Collection references
    this.collections = {
      locations: db.collection('locations'),
      userLocations: db.collection('user_locations'),
      geofences: db.collection('geofences'),
      locationAnalytics: db.collection('location_analytics'),
      spatialEvents: db.collection('spatial_events')
    };

    // Performance tracking
    this.queryMetrics = {
      totalQueries: 0,
      averageQueryTime: 0,
      spatialQueries: 0,
      indexHits: 0
    };

    this.initializeGeospatialCollections();
  }

  async initializeGeospatialCollections() {
    console.log('Initializing geospatial collections and spatial indexes...');

    try {
      // Setup locations collection with advanced spatial indexing
      await this.setupLocationsCollection();

      // Setup user location tracking
      await this.setupUserLocationTracking();

      // Setup geofencing capabilities
      await this.setupGeofencingSystem();

      // Setup location analytics
      await this.setupLocationAnalytics();

      // Setup spatial event tracking
      await this.setupSpatialEventTracking();

      console.log('All geospatial collections initialized successfully');

    } catch (error) {
      console.error('Error initializing geospatial collections:', error);
      throw error;
    }
  }

  async setupLocationsCollection() {
    console.log('Setting up locations collection with spatial indexing...');

    const locationsCollection = this.collections.locations;

    // Create 2dsphere index for geospatial queries (primary spatial index)
    await locationsCollection.createIndex(
      { coordinates: '2dsphere' },
      { 
        background: true,
        name: 'coordinates_2dsphere',
        // Optimize for common query patterns
        '2dsphereIndexVersion': 3
      }
    );

    // Compound indexes for optimized spatial queries with filters
    await locationsCollection.createIndex(
      { coordinates: '2dsphere', category: 1 },
      { background: true, name: 'spatial_category_index' }
    );

    await locationsCollection.createIndex(
      { coordinates: '2dsphere', rating: -1, priceRange: 1 },
      { background: true, name: 'spatial_rating_price_index' }
    );

    // Coverage area indexing for geofencing
    await locationsCollection.createIndex(
      { coverageArea: '2dsphere' },
      { 
        background: true, 
        sparse: true, 
        name: 'coverage_area_index' 
      }
    );

    // Text index for location search
    await locationsCollection.createIndex(
      { name: 'text', address: 'text', category: 'text' },
      { 
        background: true,
        name: 'location_text_search',
        weights: { name: 3, category: 2, address: 1 }
      }
    );

    // Additional performance indexes
    await locationsCollection.createIndex(
      { category: 1, rating: -1, createdAt: -1 },
      { background: true }
    );

    console.log('Locations collection spatial indexing complete');
  }

  async createLocation(locationData) {
    console.log('Creating location with geospatial data...');

    const startTime = Date.now();

    try {
      const locationDocument = {
        locationId: locationData.locationId || new ObjectId(),
        name: locationData.name,
        category: locationData.category,

        // GeoJSON Point for precise coordinates
        coordinates: {
          type: 'Point',
          coordinates: [locationData.longitude, locationData.latitude] // [lng, lat] order in GeoJSON
        },

        // Optional coverage area as GeoJSON Polygon
        coverageArea: locationData.coverageArea ? {
          type: 'Polygon',
          coordinates: locationData.coverageArea // Array of coordinate arrays
        } : null,

        // Location details
        address: locationData.address,
        city: locationData.city,
        state: locationData.state,
        country: locationData.country,
        postalCode: locationData.postalCode,

        // Business information
        phoneNumber: locationData.phoneNumber,
        website: locationData.website,
        operatingHours: locationData.operatingHours || {},
        rating: locationData.rating || 0,
        priceRange: locationData.priceRange || 1,

        // Spatial metadata
        accuracy: locationData.accuracy,
        altitude: locationData.altitude,
        floor: locationData.floor,

        // Business analytics data
        popularityScore: locationData.popularityScore || 0,
        trafficLevel: locationData.trafficLevel,
        accessibilityFeatures: locationData.accessibilityFeatures || [],

        // Temporal information
        createdAt: new Date(),
        updatedAt: new Date(),

        // Custom attributes
        customAttributes: locationData.customAttributes || {},
        tags: locationData.tags || [],

        // Verification status
        verified: locationData.verified || false,
        verificationSource: locationData.verificationSource
      };

      // Validate GeoJSON format
      if (!this.validateGeoJSONPoint(locationDocument.coordinates)) {
        throw new Error('Invalid coordinates format - must be valid GeoJSON Point');
      }

      const result = await this.collections.locations.insertOne(locationDocument);

      const processingTime = Date.now() - startTime;
      this.updateQueryMetrics('create_location', processingTime);

      console.log(`Location created: ${result.insertedId} (${processingTime}ms)`);

      return {
        success: true,
        locationId: result.insertedId,
        coordinates: locationDocument.coordinates,
        processingTime: processingTime
      };

    } catch (error) {
      console.error('Error creating location:', error);
      return {
        success: false,
        error: error.message,
        processingTime: Date.now() - startTime
      };
    }
  }

  async findNearbyLocations(longitude, latitude, options = {}) {
    console.log(`Finding locations near [${longitude}, ${latitude}]...`);

    const startTime = Date.now();

    try {
      // Build aggregation pipeline for advanced spatial query
      const pipeline = [
        // Stage 1: Geospatial proximity matching
        {
          $geoNear: {
            near: {
              type: 'Point',
              coordinates: [longitude, latitude]
            },
            distanceField: 'distanceMeters',
            maxDistance: options.maxDistance || this.config.defaultSearchRadius,
            spherical: true,

            // Advanced filtering options
            query: {
              ...(options.category && { category: options.category }),
              ...(options.minRating && { rating: { $gte: options.minRating } }),
              ...(options.maxPriceRange && { priceRange: { $lte: options.maxPriceRange } }),
              ...(options.verified !== undefined && { verified: options.verified }),
              ...(options.tags && { tags: { $in: options.tags } })
            },

            // Limit initial results for performance
            limit: options.limit || this.config.defaultMaxResults
          }
        },

        // Stage 2: Add computed fields and spatial analysis
        {
          $addFields: {
            // Distance calculations
            distanceKm: { $divide: ['$distanceMeters', 1000] },

            // Bearing calculation (direction from search point to location)
            bearing: {
              $let: {
                vars: {
                  lat1: { $degreesToRadians: latitude },
                  lat2: { $degreesToRadians: { $arrayElemAt: ['$coordinates.coordinates', 1] } },
                  lng1: { $degreesToRadians: longitude },
                  lng2: { $degreesToRadians: { $arrayElemAt: ['$coordinates.coordinates', 0] } }
                },
                in: {
                  $mod: [
                    {
                      $add: [
                        {
                          $radiansToDegrees: {
                            $atan2: [
                              {
                                $sin: { $subtract: ['$$lng2', '$$lng1'] }
                              },
                              {
                                $subtract: [
                                  {
                                    $multiply: [
                                      { $cos: '$$lat1' },
                                      { $sin: '$$lat2' }
                                    ]
                                  },
                                  {
                                    $multiply: [
                                      { $sin: '$$lat1' },
                                      { $cos: '$$lat2' },
                                      { $cos: { $subtract: ['$$lng2', '$$lng1'] } }
                                    ]
                                  }
                                ]
                              }
                            ]
                          }
                        },
                        360
                      ]
                    },
                    360
                  ]
                }
              }
            },

            // Proximity categorization
            proximityCategory: {
              $switch: {
                branches: [
                  { case: { $lte: ['$distanceMeters', 100] }, then: 'immediate_vicinity' },
                  { case: { $lte: ['$distanceMeters', 500] }, then: 'very_close' },
                  { case: { $lte: ['$distanceMeters', 1000] }, then: 'walking_distance' },
                  { case: { $lte: ['$distanceMeters', 5000] }, then: 'short_drive' }
                ],
                default: 'distant'
              }
            },

            // Recommendation scoring
            recommendationScore: {
              $add: [
                // Base rating score (0-5 scale)
                { $multiply: ['$rating', 2] },

                // Distance penalty (closer is better)
                {
                  $subtract: [
                    10,
                    { $divide: ['$distanceMeters', 500] }
                  ]
                },

                // Popularity bonus
                { $multiply: ['$popularityScore', 0.5] },

                // Verification bonus
                { $cond: [{ $eq: ['$verified', true] }, 2, 0] }
              ]
            }
          }
        },

        // Stage 3: Operating hours analysis (if requested)
        ...(options.checkOperatingHours ? [{
          $addFields: {
            currentlyOpen: {
              $let: {
                vars: {
                  now: new Date(),
                  dayOfWeek: { $dayOfWeek: new Date() }, // 1 = Sunday, 7 = Saturday
                  currentTime: { 
                    $dateToString: { 
                      format: '%H:%M', 
                      date: new Date() 
                    } 
                  }
                },
                in: {
                  // Simplified operating hours check
                  $cond: [
                    { $ne: ['$operatingHours', null] },
                    true, // Would implement complex time checking logic
                    null
                  ]
                }
              }
            }
          }
        }] : []),

        // Stage 4: Coverage area intersection (if requested)
        ...(options.checkCoverageArea ? [{
          $addFields: {
            withinCoverageArea: {
              $cond: [
                { $ne: ['$coverageArea', null] },
                {
                  $function: {
                    body: `function(coverageArea, searchPoint) {
                      // Simplified point-in-polygon check
                      // In production, use MongoDB's native $geoIntersects
                      return true; // Placeholder for complex geometric calculation
                    }`,
                    args: ['$coverageArea', { type: 'Point', coordinates: [longitude, latitude] }],
                    lang: 'js'
                  }
                },
                null
              ]
            }
          }
        }] : []),

        // Stage 5: Final sorting and formatting
        {
          $sort: {
            // Default sort by recommendation score, fallback to distance
            recommendationScore: options.sortBy === 'recommendation' ? -1 : 1,
            distanceMeters: options.sortBy === 'distance' ? 1 : -1,
            rating: -1
          }
        },

        // Stage 6: Limit results
        { $limit: options.limit || this.config.defaultMaxResults },

        // Stage 7: Project final result structure
        {
          $project: {
            locationId: 1,
            name: 1,
            category: 1,
            coordinates: 1,
            address: 1,
            city: 1,
            state: 1,
            country: 1,
            phoneNumber: 1,
            website: 1,
            rating: 1,
            priceRange: 1,

            // Spatial analysis results
            distanceMeters: { $round: ['$distanceMeters', 2] },
            distanceKm: { $round: ['$distanceKm', 3] },
            bearing: { $round: ['$bearing', 1] },
            proximityCategory: 1,
            recommendationScore: { $round: ['$recommendationScore', 2] },

            // Conditional fields
            ...(options.checkOperatingHours && { currentlyOpen: 1 }),
            ...(options.checkCoverageArea && { withinCoverageArea: 1 }),

            // Metadata
            verified: 1,
            tags: 1,
            createdAt: 1,

            // Custom attributes if requested
            ...(options.includeCustomAttributes && { customAttributes: 1 })
          }
        }
      ];

      // Execute aggregation pipeline
      const locations = await this.collections.locations.aggregate(
        pipeline,
        {
          allowDiskUse: true,
          maxTimeMS: this.config.queryTimeoutMs,
          hint: 'coordinates_2dsphere' // Use spatial index
        }
      ).toArray();

      const processingTime = Date.now() - startTime;
      this.updateQueryMetrics('nearby_search', processingTime);

      console.log(`Found ${locations.length} nearby locations (${processingTime}ms)`);

      return {
        success: true,
        locations: locations,
        searchParams: {
          coordinates: [longitude, latitude],
          maxDistance: options.maxDistance || this.config.defaultSearchRadius,
          filters: options
        },
        resultsCount: locations.length,
        processingTime: processingTime
      };

    } catch (error) {
      console.error('Error finding nearby locations:', error);
      return {
        success: false,
        error: error.message,
        processingTime: Date.now() - startTime
      };
    }
  }

  async setupUserLocationTracking() {
    console.log('Setting up user location tracking...');

    const userLocationsCollection = this.collections.userLocations;

    // Spatial index for user locations
    await userLocationsCollection.createIndex(
      { coordinates: '2dsphere' },
      { background: true, name: 'user_coordinates_spatial' }
    );

    // Compound indexes for user tracking queries
    await userLocationsCollection.createIndex(
      { userId: 1, recordedAt: -1 },
      { background: true, name: 'user_timeline' }
    );

    await userLocationsCollection.createIndex(
      { sessionId: 1, recordedAt: 1 },
      { background: true, name: 'session_tracking' }
    );

    // Geofencing compound index
    await userLocationsCollection.createIndex(
      { coordinates: '2dsphere', userId: 1, recordedAt: -1 },
      { background: true, name: 'spatial_user_timeline' }
    );

    console.log('User location tracking setup complete');
  }

  async trackUserLocation(userId, longitude, latitude, metadata = {}) {
    console.log(`Tracking location for user ${userId}: [${longitude}, ${latitude}]`);

    const startTime = Date.now();

    try {
      const locationDocument = {
        userId: userId,
        coordinates: {
          type: 'Point',
          coordinates: [longitude, latitude]
        },

        // Accuracy and technical metadata
        accuracy: metadata.accuracy,
        altitude: metadata.altitude,
        speed: metadata.speed,
        heading: metadata.heading,

        // Device and method information
        locationMethod: metadata.locationMethod || 'GPS',
        deviceType: metadata.deviceType,
        batteryLevel: metadata.batteryLevel,

        // Session context
        sessionId: metadata.sessionId,
        applicationContext: metadata.applicationContext,

        // Privacy and sharing
        locationSharingLevel: metadata.locationSharingLevel || 'private',
        allowGeofenceNotifications: metadata.allowGeofenceNotifications || false,

        // Temporal information
        recordedAt: metadata.recordedAt || new Date(),
        serverProcessedAt: new Date(),

        // Movement analysis
        isStationary: metadata.isStationary || false,
        movementType: metadata.movementType, // walking, driving, cycling, stationary

        // Custom context
        customData: metadata.customData || {}
      };

      // Validate coordinates
      if (!this.validateGeoJSONPoint(locationDocument.coordinates)) {
        throw new Error('Invalid coordinates for user location tracking');
      }

      const result = await this.collections.userLocations.insertOne(locationDocument);

      // Check for geofence triggers (if enabled)
      if (this.config.enableGeofencing) {
        await this.checkGeofenceEvents(userId, longitude, latitude);
      }

      const processingTime = Date.now() - startTime;
      this.updateQueryMetrics('track_user_location', processingTime);

      return {
        success: true,
        locationId: result.insertedId,
        coordinates: locationDocument.coordinates,
        processingTime: processingTime,
        geofenceChecked: this.config.enableGeofencing
      };

    } catch (error) {
      console.error('Error tracking user location:', error);
      return {
        success: false,
        error: error.message,
        processingTime: Date.now() - startTime
      };
    }
  }

  async getUserLocationHistory(userId, options = {}) {
    console.log(`Retrieving location history for user ${userId}...`);

    const startTime = Date.now();

    try {
      const pipeline = [
        // Stage 1: Filter by user and time range
        {
          $match: {
            userId: userId,
            recordedAt: {
              $gte: options.startDate || new Date(Date.now() - (7 * 24 * 60 * 60 * 1000)), // 7 days default
              $lte: options.endDate || new Date()
            },
            ...(options.sessionId && { sessionId: options.sessionId }),
            ...(options.locationSharingLevel && { locationSharingLevel: options.locationSharingLevel })
          }
        },

        // Stage 2: Sort chronologically
        { $sort: { recordedAt: 1 } },

        // Stage 3: Add movement analysis
        {
          $addFields: {
            // Calculate time since last location update
            timeSincePrevious: {
              $subtract: [
                '$recordedAt',
                { $ifNull: [{ $lag: '$recordedAt', offset: 1 }, '$recordedAt'] }
              ]
            }
          }
        },

        // Stage 4: Movement calculations using $setWindowFields
        {
          $setWindowFields: {
            partitionBy: '$userId',
            sortBy: { recordedAt: 1 },
            output: {
              // Distance from previous location
              distanceFromPrevious: {
                $function: {
                  body: `function(currentCoords, previousCoords) {
                    if (!previousCoords) return 0;

                    // Haversine formula for distance calculation
                    const R = 6371000; // Earth's radius in meters
                    const lat1 = currentCoords.coordinates[1] * Math.PI / 180;
                    const lat2 = previousCoords.coordinates[1] * Math.PI / 180;
                    const deltaLat = (previousCoords.coordinates[1] - currentCoords.coordinates[1]) * Math.PI / 180;
                    const deltaLng = (previousCoords.coordinates[0] - currentCoords.coordinates[0]) * Math.PI / 180;

                    const a = Math.sin(deltaLat/2) * Math.sin(deltaLat/2) +
                             Math.cos(lat1) * Math.cos(lat2) *
                             Math.sin(deltaLng/2) * Math.sin(deltaLng/2);
                    const c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1-a));

                    return R * c;
                  }`,
                  args: ['$coordinates', { $lag: ['$coordinates', 1] }],
                  lang: 'js'
                }
              },

              // Running total distance
              totalDistanceTraveled: {
                $sum: '$distanceFromPrevious',
                window: { documents: ['unbounded', 'current'] }
              }
            }
          }
        },

        // Stage 5: Limit results
        { $limit: options.limit || 1000 },

        // Stage 6: Project final format
        {
          $project: {
            coordinates: 1,
            accuracy: 1,
            altitude: 1,
            speed: 1,
            heading: 1,
            locationMethod: 1,
            recordedAt: 1,
            sessionId: 1,
            movementType: 1,

            // Calculated fields
            distanceFromPrevious: { $round: ['$distanceFromPrevious', 2] },
            totalDistanceTraveled: { $round: ['$totalDistanceTraveled', 2] },
            timeSincePrevious: { $divide: ['$timeSincePrevious', 1000] }, // Convert to seconds

            // Privacy filtered custom data
            ...(options.includeCustomData && { customData: 1 })
          }
        }
      ];

      const locationHistory = await this.collections.userLocations.aggregate(
        pipeline,
        { allowDiskUse: true, maxTimeMS: this.config.queryTimeoutMs }
      ).toArray();

      // Calculate summary statistics
      const totalDistance = locationHistory.reduce((sum, loc) => sum + (loc.distanceFromPrevious || 0), 0);
      const timespan = locationHistory.length > 0 ? 
        new Date(locationHistory[locationHistory.length - 1].recordedAt) - new Date(locationHistory[0].recordedAt) : 0;

      const processingTime = Date.now() - startTime;
      this.updateQueryMetrics('user_location_history', processingTime);

      return {
        success: true,
        locationHistory: locationHistory,
        summary: {
          totalPoints: locationHistory.length,
          totalDistanceMeters: Math.round(totalDistance),
          totalDistanceKm: Math.round(totalDistance / 1000 * 100) / 100,
          timespanHours: Math.round(timespan / (1000 * 60 * 60) * 100) / 100,
          averageSpeed: timespan > 0 ? Math.round((totalDistance / (timespan / 1000)) * 3.6 * 100) / 100 : 0 // km/h
        },
        processingTime: processingTime
      };

    } catch (error) {
      console.error('Error retrieving user location history:', error);
      return {
        success: false,
        error: error.message,
        processingTime: Date.now() - startTime
      };
    }
  }

  async setupGeofencingSystem() {
    console.log('Setting up geofencing system...');

    const geofencesCollection = this.collections.geofences;

    // Spatial index for geofence areas
    await geofencesCollection.createIndex(
      { area: '2dsphere' },
      { background: true, name: 'geofence_spatial' }
    );

    // Compound indexes for geofence queries
    await geofencesCollection.createIndex(
      { ownerId: 1, isActive: 1 },
      { background: true }
    );

    await geofencesCollection.createIndex(
      { category: 1, isActive: 1 },
      { background: true }
    );

    console.log('Geofencing system setup complete');
  }

  async createGeofence(ownerId, geofenceData) {
    console.log(`Creating geofence for owner ${ownerId}...`);

    const startTime = Date.now();

    try {
      const geofenceDocument = {
        geofenceId: new ObjectId(),
        ownerId: ownerId,
        name: geofenceData.name,
        description: geofenceData.description,
        category: geofenceData.category || 'custom',

        // GeoJSON area (Polygon or Circle)
        area: geofenceData.area,

        // Geofence behavior
        triggerOnEntry: geofenceData.triggerOnEntry !== false,
        triggerOnExit: geofenceData.triggerOnExit !== false,
        triggerOnDwell: geofenceData.triggerOnDwell || false,
        dwellTimeSeconds: geofenceData.dwellTimeSeconds || 300, // 5 minutes

        // Notification settings
        notificationSettings: {
          enabled: geofenceData.notifications?.enabled !== false,
          methods: geofenceData.notifications?.methods || ['push'],
          message: geofenceData.notifications?.message
        },

        // Targeting
        targetUsers: geofenceData.targetUsers || [], // Specific user IDs
        targetUserGroups: geofenceData.targetUserGroups || [],

        // Scheduling
        schedule: geofenceData.schedule || {
          enabled: true,
          startTime: null,
          endTime: null,
          daysOfWeek: [1, 2, 3, 4, 5, 6, 7] // All days
        },

        // State management
        isActive: geofenceData.isActive !== false,
        createdAt: new Date(),
        updatedAt: new Date(),

        // Analytics
        entryCount: 0,
        exitCount: 0,
        dwellCount: 0,
        lastTriggered: null,

        // Custom data
        customData: geofenceData.customData || {}
      };

      // Validate GeoJSON area
      if (!this.validateGeoJSONGeometry(geofenceDocument.area)) {
        throw new Error('Invalid geofence area geometry');
      }

      const result = await geofencesCollection.insertOne(geofenceDocument);

      const processingTime = Date.now() - startTime;

      return {
        success: true,
        geofenceId: result.insertedId,
        area: geofenceDocument.area,
        processingTime: processingTime
      };

    } catch (error) {
      console.error('Error creating geofence:', error);
      return {
        success: false,
        error: error.message,
        processingTime: Date.now() - startTime
      };
    }
  }

  async checkGeofenceEvents(userId, longitude, latitude) {
    console.log(`Checking geofence events for user ${userId} at [${longitude}, ${latitude}]...`);

    try {
      const userPoint = {
        type: 'Point',
        coordinates: [longitude, latitude]
      };

      // Find all active geofences that intersect with user location
      const intersectingGeofences = await this.collections.geofences.find({
        isActive: true,

        // Spatial intersection query
        area: {
          $geoIntersects: {
            $geometry: userPoint
          }
        },

        // Check if user is targeted (empty array means all users)
        $or: [
          { targetUsers: { $size: 0 } },
          { targetUsers: userId }
        ]
      }).toArray();

      // Process each intersecting geofence
      const geofenceEvents = [];

      for (const geofence of intersectingGeofences) {
        // Check if this is a new entry or existing presence
        const recentUserLocation = await this.collections.userLocations.findOne({
          userId: userId,
          recordedAt: { $gte: new Date(Date.now() - (5 * 60 * 1000)) } // Last 5 minutes
        }, { sort: { recordedAt: -1 } });

        let eventType = 'dwelling';

        if (!recentUserLocation) {
          eventType = 'entry';
        }

        // Create geofence event
        const geofenceEvent = {
          eventId: new ObjectId(),
          userId: userId,
          geofenceId: geofence.geofenceId,
          geofenceName: geofence.name,
          eventType: eventType,
          coordinates: userPoint,
          eventTime: new Date(),

          // Context information
          geofenceCategory: geofence.category,
          dwellTimeSeconds: eventType === 'dwelling' ? 
            (recentUserLocation ? (Date.now() - recentUserLocation.recordedAt.getTime()) / 1000 : 0) : 0,

          // Notification triggered
          notificationTriggered: geofence.notificationSettings.enabled &&
            ((eventType === 'entry' && geofence.triggerOnEntry) ||
             (eventType === 'dwelling' && geofence.triggerOnDwell)),

          customData: geofence.customData
        };

        // Store the event
        await this.collections.spatialEvents.insertOne(geofenceEvent);

        // Update geofence statistics
        const updateFields = {};
        updateFields[`${eventType}Count`] = 1;
        updateFields.lastTriggered = new Date();

        await this.collections.geofences.updateOne(
          { geofenceId: geofence.geofenceId },
          { 
            $inc: updateFields,
            $set: { updatedAt: new Date() }
          }
        );

        geofenceEvents.push(geofenceEvent);

        // Trigger notifications if configured
        if (geofenceEvent.notificationTriggered) {
          await this.triggerGeofenceNotification(userId, geofenceEvent);
        }
      }

      return {
        success: true,
        eventsTriggered: geofenceEvents.length,
        events: geofenceEvents
      };

    } catch (error) {
      console.error('Error checking geofence events:', error);
      return {
        success: false,
        error: error.message
      };
    }
  }

  async triggerGeofenceNotification(userId, geofenceEvent) {
    // Placeholder for notification system integration
    console.log(`Geofence notification triggered for user ${userId}:`, {
      geofence: geofenceEvent.geofenceName,
      eventType: geofenceEvent.eventType,
      location: geofenceEvent.coordinates
    });

    // In a real implementation, this would integrate with:
    // - Push notification services
    // - SMS/Email services  
    // - Webhook endpoints
    // - Real-time messaging systems
  }

  validateGeoJSONPoint(coordinates) {
    return coordinates &&
           coordinates.type === 'Point' &&
           Array.isArray(coordinates.coordinates) &&
           coordinates.coordinates.length === 2 &&
           typeof coordinates.coordinates[0] === 'number' &&
           typeof coordinates.coordinates[1] === 'number' &&
           coordinates.coordinates[0] >= -180 && coordinates.coordinates[0] <= 180 &&
           coordinates.coordinates[1] >= -90 && coordinates.coordinates[1] <= 90;
  }

  validateGeoJSONGeometry(geometry) {
    if (!geometry || !geometry.type) return false;

    switch (geometry.type) {
      case 'Point':
        return this.validateGeoJSONPoint(geometry);
      case 'Polygon':
        return geometry.coordinates &&
               Array.isArray(geometry.coordinates) &&
               geometry.coordinates.length > 0 &&
               Array.isArray(geometry.coordinates[0]) &&
               geometry.coordinates[0].length >= 4; // Minimum for polygon
      case 'Circle':
        // MongoDB extension for circular geofences
        return geometry.coordinates &&
               Array.isArray(geometry.coordinates) &&
               geometry.coordinates.length === 2 &&
               typeof geometry.radius === 'number' &&
               geometry.radius > 0;
      default:
        return false;
    }
  }

  updateQueryMetrics(queryType, duration) {
    this.queryMetrics.totalQueries++;
    this.queryMetrics.averageQueryTime = 
      (this.queryMetrics.averageQueryTime + duration) / 2;

    if (queryType.includes('spatial') || queryType.includes('nearby') || queryType.includes('geofence')) {
      this.queryMetrics.spatialQueries++;
    }

    if (this.config.logSlowQueries && duration > 1000) {
      console.log(`Slow query detected: ${queryType} took ${duration}ms`);
    }
  }

  async getPerformanceMetrics() {
    return {
      queryMetrics: this.queryMetrics,
      indexMetrics: await this.analyzeIndexPerformance(),
      collectionStats: await this.getCollectionStatistics()
    };
  }

  async analyzeIndexPerformance() {
    const metrics = {};

    for (const [collectionName, collection] of Object.entries(this.collections)) {
      try {
        const indexStats = await collection.aggregate([{ $indexStats: {} }]).toArray();
        metrics[collectionName] = indexStats;
      } catch (error) {
        console.error(`Error analyzing indexes for ${collectionName}:`, error);
      }
    }

    return metrics;
  }

  async getCollectionStatistics() {
    const stats = {};

    for (const [collectionName, collection] of Object.entries(this.collections)) {
      try {
        stats[collectionName] = await collection.stats();
      } catch (error) {
        console.error(`Error getting stats for ${collectionName}:`, error);
      }
    }

    return stats;
  }

  async shutdown() {
    console.log('Shutting down geospatial manager...');

    // Log final performance metrics
    if (this.config.enablePerformanceMetrics) {
      const metrics = await this.getPerformanceMetrics();
      console.log('Final Performance Metrics:', metrics.queryMetrics);
    }

    console.log('Geospatial manager shutdown complete');
  }
}

// Benefits of MongoDB Geospatial Operations:
// - Native 2dsphere indexing with optimized spatial queries
// - Comprehensive GeoJSON support for points, polygons, and complex geometries  
// - High-performance proximity searches with built-in distance calculations
// - Advanced geofencing capabilities with real-time event triggering
// - Seamless integration with application data without external GIS systems
// - Sophisticated spatial aggregation and analytics capabilities
// - Built-in coordinate system support and projection handling
// - Optimized query performance with spatial index utilization
// - SQL-compatible geospatial operations through QueryLeaf integration
// - Scalable location-based services with MongoDB's distributed architecture

module.exports = {
  MongoDBGeospatialManager
};

Understanding MongoDB Geospatial Architecture

Advanced Spatial Indexing and Query Optimization Patterns

Implement sophisticated geospatial strategies for production MongoDB deployments:

// Production-ready MongoDB geospatial operations with advanced optimization and analytics
class ProductionGeospatialProcessor extends MongoDBGeospatialManager {
  constructor(db, productionConfig) {
    super(db, productionConfig);

    this.productionConfig = {
      ...productionConfig,
      enableAdvancedAnalytics: true,
      enableSpatialCaching: true,
      enableLocationIntelligence: true,
      enablePredictiveGeofencing: true,
      enableSpatialDataMining: true,
      enableRealtimeLocationStreams: true
    };

    this.setupProductionOptimizations();
    this.initializeAdvancedGeospatial();
    this.setupLocationIntelligence();
  }

  async implementAdvancedSpatialAnalytics() {
    console.log('Implementing advanced spatial analytics capabilities...');

    const analyticsStrategy = {
      // Location intelligence
      locationIntelligence: {
        enableHeatmapGeneration: true,
        enableClusterAnalysis: true,
        enablePatternDetection: true,
        enablePredictiveModeling: true
      },

      // Spatial data mining
      spatialDataMining: {
        enableLocationCorrelation: true,
        enableMovementPatternAnalysis: true,
        enableSpatialAnomalyDetection: true,
        enableLocationRecommendations: true
      },

      // Real-time processing
      realtimeProcessing: {
        enableStreamingGeoprocessing: true,
        enableDynamicGeofencing: true,
        enableLocationEventCorrelation: true,
        enableSpatialAlertSystems: true
      }
    };

    return await this.deployAdvancedSpatialAnalytics(analyticsStrategy);
  }

  async setupSpatialCachingSystem() {
    console.log('Setting up advanced spatial caching system...');

    const cachingConfig = {
      // Spatial query caching
      spatialQueryCache: {
        enableProximityCache: true,
        cacheRadius: 1000, // Cache results within 1km
        cacheExpiration: 300, // 5 minutes
        maxCacheEntries: 10000
      },

      // Geofence optimization
      geofenceOptimization: {
        enableGeofenceIndex: true,
        spatialPartitioning: true,
        dynamicGeofenceLoading: true,
        geofenceHierarchy: true
      },

      // Location intelligence cache
      locationIntelligenceCache: {
        enableHeatmapCache: true,
        enablePatternCache: true,
        enablePredictionCache: true
      }
    };

    return await this.deploySpatalCaching(cachingConfig);
  }

  async implementPredictiveGeofencing() {
    console.log('Implementing predictive geofencing capabilities...');

    const predictiveConfig = {
      // Movement prediction
      movementPrediction: {
        enableTrajectoryPrediction: true,
        predictionAccuracy: 0.85,
        predictionTimeHorizon: 1800, // 30 minutes
        learningModelUpdates: true
      },

      // Dynamic geofence creation
      dynamicGeofencing: {
        enablePredictiveGeofences: true,
        contextAwareGeofences: true,
        temporalGeofences: true,
        adaptiveGeofenceSizes: true
      },

      // Behavioral analysis
      behavioralAnalysis: {
        enableLocationPatterns: true,
        enableRoutePrediction: true,
        enableDestinationPrediction: true,
        enableActivityRecognition: true
      }
    };

    return await this.deployPredictiveGeofencing(predictiveConfig);
  }
}

SQL-Style Geospatial Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB geospatial operations and location-based services:

-- QueryLeaf geospatial operations with SQL-familiar syntax for MongoDB

-- Create location-enabled table with spatial indexing
CREATE TABLE locations (
  location_id UUID PRIMARY KEY,
  name VARCHAR(255) NOT NULL,
  category VARCHAR(100) NOT NULL,

  -- Geospatial coordinates (automatically creates 2dsphere index)
  coordinates POINT NOT NULL,
  coverage_area POLYGON,

  -- Location details
  address TEXT,
  city VARCHAR(100),
  state VARCHAR(50),
  country VARCHAR(50),
  postal_code VARCHAR(20),

  -- Business information
  phone_number VARCHAR(20),
  website VARCHAR(255),
  operating_hours DOCUMENT,
  rating DECIMAL(3,2) DEFAULT 0,
  price_range INTEGER DEFAULT 1,

  -- Analytics and metadata
  popularity_score DECIMAL(6,2) DEFAULT 0,
  verified BOOLEAN DEFAULT false,
  tags TEXT[],

  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
WITH SPATIAL_INDEXING (
  coordinates USING '2dsphere',
  coverage_area USING '2dsphere',

  -- Compound spatial indexes for optimized queries
  COMPOUND INDEX (coordinates, category),
  COMPOUND INDEX (coordinates, rating DESC, price_range ASC)
);

-- User location tracking table
CREATE TABLE user_locations (
  user_location_id UUID PRIMARY KEY,
  user_id VARCHAR(50) NOT NULL,
  coordinates POINT NOT NULL,

  -- Accuracy and technical details
  accuracy_meters DECIMAL(8,2),
  altitude_meters DECIMAL(8,2),
  speed_kmh DECIMAL(6,2),
  heading_degrees DECIMAL(5,2),

  -- Context and metadata
  location_method VARCHAR(50) DEFAULT 'GPS',
  device_type VARCHAR(50),
  session_id VARCHAR(100),

  -- Temporal tracking
  recorded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

  -- Privacy settings
  location_sharing_level VARCHAR(20) DEFAULT 'private'
)
WITH SPATIAL_INDEXING (
  coordinates USING '2dsphere',
  COMPOUND INDEX (user_id, recorded_at DESC),
  COMPOUND INDEX (coordinates, user_id, recorded_at DESC)
);

-- Insert locations with spatial data
INSERT INTO locations (
  name, category, coordinates, address, city, state, country,
  phone_number, rating, price_range, tags
) VALUES 
  ('Central Park Cafe', 'restaurant', POINT(-73.965355, 40.782865), 
   '123 Central Park West', 'New York', 'NY', 'USA',
   '+1-212-555-0123', 4.5, 2, ARRAY['cafe', 'outdoor_seating', 'wifi']),

  ('Brooklyn Bridge Pizza', 'restaurant', POINT(-73.997638, 40.706877),
   '456 Brooklyn Bridge Blvd', 'New York', 'NY', 'USA', 
   '+1-718-555-0456', 4.2, 1, ARRAY['pizza', 'takeout', 'delivery']),

  ('Times Square Hotel', 'hotel', POINT(-73.985130, 40.758896),
   '789 Times Square', 'New York', 'NY', 'USA',
   '+1-212-555-0789', 4.0, 3, ARRAY['hotel', 'tourist_area', 'business_center']);

-- Advanced proximity search with spatial functions
WITH nearby_search AS (
  SELECT 
    location_id,
    name,
    category,
    coordinates,
    address,
    rating,
    price_range,
    tags,

    -- Distance calculation using spatial functions
    ST_DISTANCE(coordinates, POINT(-73.985130, 40.758896)) as distance_meters,

    -- Bearing (direction) from search point to location
    ST_AZIMUTH(POINT(-73.985130, 40.758896), coordinates) as bearing_radians,
    ST_AZIMUTH(POINT(-73.985130, 40.758896), coordinates) * 180 / PI() as bearing_degrees,

    -- Proximity categorization
    CASE 
      WHEN ST_DISTANCE(coordinates, POINT(-73.985130, 40.758896)) <= 100 THEN 'immediate_vicinity'
      WHEN ST_DISTANCE(coordinates, POINT(-73.985130, 40.758896)) <= 500 THEN 'very_close'
      WHEN ST_DISTANCE(coordinates, POINT(-73.985130, 40.758896)) <= 1000 THEN 'walking_distance'
      WHEN ST_DISTANCE(coordinates, POINT(-73.985130, 40.758896)) <= 5000 THEN 'short_drive'
      ELSE 'distant'
    END as proximity_category

  FROM locations
  WHERE 
    -- Spatial proximity filter (uses spatial index automatically)
    ST_DWITHIN(coordinates, POINT(-73.985130, 40.758896), 2000) -- Within 2km

    -- Additional filters
    AND category = 'restaurant'
    AND rating >= 4.0
    AND price_range <= 2

  ORDER BY distance_meters ASC
  LIMIT 20
),

enhanced_results AS (
  SELECT 
    ns.*,

    -- Enhanced distance information
    ROUND(distance_meters, 2) as distance_meters_rounded,
    ROUND(distance_meters / 1000, 3) as distance_km,

    -- Cardinal direction
    CASE 
      WHEN bearing_degrees >= 337.5 OR bearing_degrees < 22.5 THEN 'North'
      WHEN bearing_degrees >= 22.5 AND bearing_degrees < 67.5 THEN 'Northeast'
      WHEN bearing_degrees >= 67.5 AND bearing_degrees < 112.5 THEN 'East'
      WHEN bearing_degrees >= 112.5 AND bearing_degrees < 157.5 THEN 'Southeast'
      WHEN bearing_degrees >= 157.5 AND bearing_degrees < 202.5 THEN 'South'
      WHEN bearing_degrees >= 202.5 AND bearing_degrees < 247.5 THEN 'Southwest'
      WHEN bearing_degrees >= 247.5 AND bearing_degrees < 292.5 THEN 'West'
      WHEN bearing_degrees >= 292.5 AND bearing_degrees < 337.5 THEN 'Northwest'
    END as direction,

    -- Recommendation scoring
    (
      rating * 2 +  -- Rating component
      CASE proximity_category
        WHEN 'immediate_vicinity' THEN 10
        WHEN 'very_close' THEN 8
        WHEN 'walking_distance' THEN 6
        WHEN 'short_drive' THEN 4
        ELSE 2
      END +
      (3 - price_range) * 1.5  -- Price component (lower price = higher score)
    ) as recommendation_score,

    -- Walking time estimation (average 5 km/h walking speed)
    ROUND(distance_meters / 1000 / 5 * 60, 0) as estimated_walking_minutes

  FROM nearby_search ns
)
SELECT 
  location_id,
  name,
  category,
  address,
  rating,
  price_range,
  tags,

  -- Distance and direction
  distance_meters_rounded as distance_meters,
  distance_km,
  direction,
  proximity_category,

  -- Practical information
  estimated_walking_minutes,
  recommendation_score,

  -- Helpful descriptions
  CONCAT(
    name, ' is ', distance_meters_rounded, 'm ', direction, 
    ' (', estimated_walking_minutes, ' min walk)'
  ) as location_description

FROM enhanced_results
ORDER BY recommendation_score DESC, distance_meters ASC;

-- Geofencing operations with spatial containment
CREATE TABLE geofences (
  geofence_id UUID PRIMARY KEY,
  owner_id VARCHAR(50) NOT NULL,
  name VARCHAR(255) NOT NULL,
  description TEXT,
  category VARCHAR(100) DEFAULT 'custom',

  -- Geofence area (polygon or circle)
  area POLYGON NOT NULL,

  -- Behavior configuration
  trigger_on_entry BOOLEAN DEFAULT true,
  trigger_on_exit BOOLEAN DEFAULT true,
  trigger_on_dwell BOOLEAN DEFAULT false,
  dwell_time_seconds INTEGER DEFAULT 300,

  -- Targeting
  target_users VARCHAR(50)[],
  target_user_groups VARCHAR(50)[],

  -- Status and analytics
  is_active BOOLEAN DEFAULT true,
  entry_count INTEGER DEFAULT 0,
  exit_count INTEGER DEFAULT 0,
  dwell_count INTEGER DEFAULT 0,
  last_triggered TIMESTAMP,

  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
WITH SPATIAL_INDEXING (
  area USING '2dsphere',
  COMPOUND INDEX (owner_id, is_active),
  COMPOUND INDEX (category, is_active)
);

-- Create geofences with various geometric shapes
INSERT INTO geofences (
  owner_id, name, description, category, area, target_users
) VALUES 
  -- Circular geofence around Central Park
  ('business_123', 'Central Park Zone', 'Marketing zone around Central Park', 'marketing',
   ST_BUFFER(POINT(-73.965355, 40.782865), 500), -- 500m radius circle
   ARRAY[]); -- Empty array means all users

-- Polygon geofence for Times Square area
INSERT INTO geofences (
  owner_id, name, description, category, area, trigger_on_entry, trigger_on_exit
) VALUES 
  ('business_456', 'Times Square District', 'High-traffic commercial zone', 'commercial',
   POLYGON((
     (-73.987140, 40.755751),  -- Southwest corner
     (-73.982915, 40.755751),  -- Southeast corner  
     (-73.982915, 40.762077),  -- Northeast corner
     (-73.987140, 40.762077),  -- Northwest corner
     (-73.987140, 40.755751)   -- Close the polygon
   )),
   true, true);

-- Advanced geofence event detection query
WITH user_location_check AS (
  SELECT 
    ul.user_id,
    ul.coordinates,
    ul.recorded_at,

    -- Find intersecting geofences
    g.geofence_id,
    g.name as geofence_name,
    g.category,
    g.trigger_on_entry,
    g.trigger_on_exit,
    g.trigger_on_dwell,
    g.dwell_time_seconds,

    -- Check spatial containment
    ST_CONTAINS(g.area, ul.coordinates) as is_inside_geofence,

    -- Previous location analysis for entry/exit detection
    LAG(ul.coordinates) OVER (
      PARTITION BY ul.user_id 
      ORDER BY ul.recorded_at
    ) as previous_coordinates,

    LAG(ul.recorded_at) OVER (
      PARTITION BY ul.user_id 
      ORDER BY ul.recorded_at  
    ) as previous_timestamp

  FROM user_locations ul
  CROSS JOIN geofences g
  WHERE 
    ul.recorded_at >= CURRENT_TIMESTAMP - INTERVAL '5 minutes'
    AND g.is_active = true
    AND (
      ARRAY_LENGTH(g.target_users, 1) IS NULL OR  -- No specific targeting
      ul.user_id = ANY(g.target_users)           -- User is specifically targeted
    )
    AND ST_DWITHIN(ul.coordinates, g.area, 100) -- Pre-filter for performance
),

geofence_events AS (
  SELECT 
    ulc.*,

    -- Event type detection
    CASE 
      WHEN is_inside_geofence AND previous_coordinates IS NULL THEN 'entry'
      WHEN is_inside_geofence AND NOT ST_CONTAINS(
        (SELECT area FROM geofences WHERE geofence_id = ulc.geofence_id), 
        previous_coordinates
      ) THEN 'entry'
      WHEN NOT is_inside_geofence AND ST_CONTAINS(
        (SELECT area FROM geofences WHERE geofence_id = ulc.geofence_id), 
        previous_coordinates  
      ) THEN 'exit'
      WHEN is_inside_geofence AND ST_CONTAINS(
        (SELECT area FROM geofences WHERE geofence_id = ulc.geofence_id), 
        previous_coordinates
      ) THEN 'dwelling'
      ELSE 'none'
    END as event_type,

    -- Dwell time calculation
    CASE 
      WHEN previous_timestamp IS NOT NULL THEN
        EXTRACT(EPOCH FROM (recorded_at - previous_timestamp))
      ELSE 0
    END as dwell_time_seconds_calculated

  FROM user_location_check ulc
  WHERE is_inside_geofence = true OR previous_coordinates IS NOT NULL
),

actionable_events AS (
  SELECT 
    ge.*,

    -- Determine if event should trigger notifications
    CASE 
      WHEN event_type = 'entry' AND trigger_on_entry THEN true
      WHEN event_type = 'exit' AND trigger_on_exit THEN true  
      WHEN event_type = 'dwelling' AND trigger_on_dwell AND 
           dwell_time_seconds_calculated >= dwell_time_seconds THEN true
      ELSE false
    END as should_trigger_notification,

    -- Event metadata
    CURRENT_TIMESTAMP as event_processed_at,
    GENERATE_UUID() as event_id

  FROM geofence_events ge
  WHERE event_type != 'none'
)

SELECT 
  event_id,
  user_id,
  geofence_id,
  geofence_name,
  category,
  event_type,
  coordinates,
  recorded_at,
  should_trigger_notification,
  dwell_time_seconds_calculated,

  -- Event context
  CASE event_type
    WHEN 'entry' THEN CONCAT('User entered ', geofence_name)
    WHEN 'exit' THEN CONCAT('User exited ', geofence_name)
    WHEN 'dwelling' THEN CONCAT('User dwelling in ', geofence_name, ' for ', 
                                ROUND(dwell_time_seconds_calculated), ' seconds')
  END as event_description,

  -- Notification priority
  CASE 
    WHEN category = 'security' THEN 'high'
    WHEN category = 'marketing' AND event_type = 'entry' THEN 'medium'
    WHEN event_type = 'dwelling' THEN 'low'
    ELSE 'normal'
  END as notification_priority

FROM actionable_events
WHERE should_trigger_notification = true
ORDER BY recorded_at DESC, notification_priority DESC;

-- Location analytics and heatmap generation
WITH location_density_analysis AS (
  SELECT 
    -- Create spatial grid cells (approximately 100m x 100m)
    FLOOR(ST_X(coordinates) * 1000) / 1000 as grid_lng,
    FLOOR(ST_Y(coordinates) * 1000) / 1000 as grid_lat,

    -- Calculate grid center point
    ST_POINT(
      (FLOOR(ST_X(coordinates) * 1000) + 0.5) / 1000,
      (FLOOR(ST_Y(coordinates) * 1000) + 0.5) / 1000
    ) as grid_center,

    COUNT(*) as location_count,
    COUNT(DISTINCT user_id) as unique_users,

    -- Temporal analysis
    DATE_TRUNC('hour', recorded_at) as hour_bucket,

    -- Movement analysis
    AVG(speed_kmh) as avg_speed,
    AVG(accuracy_meters) as avg_accuracy,

    -- Activity classification
    COUNT(*) FILTER (WHERE speed_kmh < 5) as stationary_count,
    COUNT(*) FILTER (WHERE speed_kmh >= 5 AND speed_kmh < 25) as walking_count,
    COUNT(*) FILTER (WHERE speed_kmh >= 25 AND speed_kmh < 60) as driving_count,
    COUNT(*) FILTER (WHERE speed_kmh >= 60) as highway_count

  FROM user_locations
  WHERE recorded_at >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  GROUP BY grid_lng, grid_lat, hour_bucket
),

heatmap_data AS (
  SELECT 
    grid_center,
    grid_lng,
    grid_lat,

    -- Density metrics
    SUM(location_count) as total_locations,
    COUNT(DISTINCT hour_bucket) as active_hours,
    AVG(location_count) as avg_locations_per_hour,
    MAX(location_count) as peak_hour_locations,

    -- User engagement
    SUM(unique_users) as total_unique_users,
    AVG(unique_users) as avg_unique_users,

    -- Activity distribution
    SUM(stationary_count) as total_stationary,
    SUM(walking_count) as total_walking,
    SUM(driving_count) as total_driving,
    SUM(highway_count) as total_highway,

    -- Movement characteristics
    AVG(avg_speed) as overall_avg_speed,
    AVG(avg_accuracy) as overall_avg_accuracy,

    -- Heat intensity calculation
    LN(SUM(location_count) + 1) * LOG(SUM(unique_users) + 1) as heat_intensity

  FROM location_density_analysis
  GROUP BY grid_center, grid_lng, grid_lat
),

hotspot_analysis AS (
  SELECT 
    hd.*,

    -- Percentile rankings for intensity
    PERCENT_RANK() OVER (ORDER BY heat_intensity) as intensity_percentile,
    PERCENT_RANK() OVER (ORDER BY total_unique_users) as user_percentile,
    PERCENT_RANK() OVER (ORDER BY total_locations) as activity_percentile,

    -- Classification
    CASE 
      WHEN heat_intensity > (SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY heat_intensity) FROM heatmap_data) THEN 'extreme_hotspot'
      WHEN heat_intensity > (SELECT PERCENTILE_CONT(0.85) WITHIN GROUP (ORDER BY heat_intensity) FROM heatmap_data) THEN 'major_hotspot'
      WHEN heat_intensity > (SELECT PERCENTILE_CONT(0.70) WITHIN GROUP (ORDER BY heat_intensity) FROM heatmap_data) THEN 'moderate_hotspot'
      WHEN heat_intensity > (SELECT PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY heat_intensity) FROM heatmap_data) THEN 'minor_activity'
      ELSE 'low_activity'
    END as hotspot_classification,

    -- Activity type classification
    CASE 
      WHEN total_stationary > (total_walking + total_driving + total_highway) * 0.7 THEN 'destination_area'
      WHEN total_walking > total_locations * 0.6 THEN 'pedestrian_area'
      WHEN total_driving > total_locations * 0.6 THEN 'transit_area'
      WHEN total_highway > total_locations * 0.4 THEN 'highway_corridor'
      ELSE 'mixed_use_area'
    END as area_type

  FROM heatmap_data hd
  WHERE total_locations >= 10  -- Filter out low-activity areas
)

SELECT 
  grid_center,
  ST_X(grid_center) as longitude,
  ST_Y(grid_center) as latitude,

  -- Density and activity metrics
  total_locations,
  total_unique_users,
  active_hours,
  avg_locations_per_hour,
  peak_hour_locations,

  -- Classification results
  hotspot_classification,
  area_type,

  -- Intensity and ranking
  ROUND(heat_intensity, 3) as heat_intensity,
  ROUND(intensity_percentile * 100, 1) as intensity_percentile_rank,

  -- Activity breakdown
  ROUND((total_stationary::NUMERIC / total_locations) * 100, 1) as stationary_pct,
  ROUND((total_walking::NUMERIC / total_locations) * 100, 1) as walking_pct,
  ROUND((total_driving::NUMERIC / total_locations) * 100, 1) as driving_pct,

  -- Movement characteristics
  ROUND(overall_avg_speed, 2) as avg_speed_kmh,
  ROUND(overall_avg_accuracy, 1) as avg_accuracy_meters,

  -- Insights and recommendations
  CASE hotspot_classification
    WHEN 'extreme_hotspot' THEN 'High-priority area for business development'
    WHEN 'major_hotspot' THEN 'Significant commercial opportunity'
    WHEN 'moderate_hotspot' THEN 'Growing activity area with potential'
    ELSE 'Monitor for emerging trends'
  END as business_recommendation

FROM hotspot_analysis
ORDER BY heat_intensity DESC, total_unique_users DESC
LIMIT 100;

-- Advanced user movement pattern analysis
WITH user_journeys AS (
  SELECT 
    user_id,
    coordinates,
    recorded_at,
    speed_kmh,

    -- Movement analysis using window functions
    LAG(coordinates) OVER (
      PARTITION BY user_id 
      ORDER BY recorded_at
    ) as prev_coordinates,

    LAG(recorded_at) OVER (
      PARTITION BY user_id 
      ORDER BY recorded_at
    ) as prev_timestamp,

    LEAD(coordinates) OVER (
      PARTITION BY user_id 
      ORDER BY recorded_at
    ) as next_coordinates,

    -- Session detection (gap > 30 minutes = new session)
    SUM(CASE 
      WHEN recorded_at - LAG(recorded_at) OVER (
        PARTITION BY user_id ORDER BY recorded_at
      ) > INTERVAL '30 minutes' THEN 1 
      ELSE 0 
    END) OVER (
      PARTITION BY user_id 
      ORDER BY recorded_at 
      ROWS UNBOUNDED PRECEDING
    ) as session_number

  FROM user_locations
  WHERE recorded_at >= CURRENT_TIMESTAMP - INTERVAL '7 days'
),

journey_segments AS (
  SELECT 
    uj.*,

    -- Distance calculations
    CASE 
      WHEN prev_coordinates IS NOT NULL THEN
        ST_DISTANCE(coordinates, prev_coordinates)
      ELSE 0
    END as distance_from_previous,

    -- Time calculations
    CASE 
      WHEN prev_timestamp IS NOT NULL THEN
        EXTRACT(EPOCH FROM (recorded_at - prev_timestamp))
      ELSE 0
    END as time_since_previous,

    -- Direction calculations
    CASE 
      WHEN prev_coordinates IS NOT NULL THEN
        ST_AZIMUTH(prev_coordinates, coordinates) * 180 / PI()
      ELSE NULL
    END as bearing_from_previous,

    -- Stop detection
    CASE 
      WHEN speed_kmh < 2 AND 
           LAG(speed_kmh) OVER (PARTITION BY user_id ORDER BY recorded_at) < 2 
      THEN true 
      ELSE false 
    END as is_stopped

  FROM user_journeys uj
),

movement_patterns AS (
  SELECT 
    user_id,
    session_number,

    -- Session boundaries
    MIN(recorded_at) as session_start,
    MAX(recorded_at) as session_end,
    EXTRACT(SECONDS FROM (MAX(recorded_at) - MIN(recorded_at))) as session_duration_seconds,

    -- Movement statistics
    COUNT(*) as total_location_points,
    SUM(distance_from_previous) as total_distance_meters,
    AVG(speed_kmh) as avg_speed_kmh,
    MAX(speed_kmh) as max_speed_kmh,

    -- Stop analysis
    COUNT(*) FILTER (WHERE is_stopped) as stop_count,
    AVG(time_since_previous) FILTER (WHERE is_stopped) as avg_stop_duration,

    -- Geographic analysis
    ST_EXTENT(coordinates) as bounding_box,
    ST_CENTROID(ST_COLLECT(coordinates)) as activity_center,

    -- Movement characteristics
    CASE 
      WHEN AVG(speed_kmh) < 5 THEN 'pedestrian'
      WHEN AVG(speed_kmh) < 25 THEN 'urban_transit'  
      WHEN AVG(speed_kmh) < 80 THEN 'highway_driving'
      ELSE 'high_speed_transit'
    END as primary_movement_mode,

    -- Journey classification
    CASE 
      WHEN SUM(distance_from_previous) < 500 THEN 'local_area'
      WHEN SUM(distance_from_previous) < 5000 THEN 'neighborhood'
      WHEN SUM(distance_from_previous) < 50000 THEN 'city_wide'
      ELSE 'long_distance'
    END as journey_scope

  FROM journey_segments
  WHERE distance_from_previous IS NOT NULL
  GROUP BY user_id, session_number
)

SELECT 
  user_id,
  session_number,
  session_start,
  session_end,

  -- Duration and distance
  ROUND(session_duration_seconds / 60, 1) as duration_minutes,
  ROUND(total_distance_meters, 2) as distance_meters,
  ROUND(total_distance_meters / 1000, 3) as distance_km,

  -- Movement characteristics
  primary_movement_mode,
  journey_scope,
  ROUND(avg_speed_kmh, 2) as avg_speed_kmh,
  ROUND(max_speed_kmh, 2) as max_speed_kmh,

  -- Activity analysis
  total_location_points,
  stop_count,
  ROUND(avg_stop_duration / 60, 1) as avg_stop_duration_minutes,

  -- Geographic insights
  ST_X(activity_center) as center_longitude,
  ST_Y(activity_center) as center_latitude,

  -- Journey insights
  CASE 
    WHEN stop_count > total_location_points * 0.3 THEN 'multi_destination_trip'
    WHEN stop_count > 0 THEN 'trip_with_stops'
    ELSE 'direct_trip'
  END as trip_pattern,

  -- Efficiency metrics
  CASE 
    WHEN session_duration_seconds > 0 THEN
      ROUND((total_distance_meters / session_duration_seconds) * 3.6, 2) -- km/h
    ELSE 0
  END as overall_journey_speed,

  -- Movement efficiency (straight line vs actual distance)
  CASE 
    WHEN bounding_box IS NOT NULL THEN
      ROUND(
        (ST_DISTANCE(
          ST_POINT(ST_XMIN(bounding_box), ST_YMIN(bounding_box)),
          ST_POINT(ST_XMAX(bounding_box), ST_YMAX(bounding_box))
        ) / NULLIF(total_distance_meters, 0)) * 100, 
        2
      )
    ELSE NULL
  END as route_efficiency_pct

FROM movement_patterns
WHERE session_duration_seconds > 60  -- Filter very short sessions
ORDER BY user_id, session_start DESC;

-- QueryLeaf provides comprehensive geospatial capabilities:
-- 1. SQL-familiar spatial data types and indexing (POINT, POLYGON, etc.)
-- 2. Advanced spatial functions (ST_DISTANCE, ST_CONTAINS, ST_BUFFER, etc.)
-- 3. Optimized proximity searches with automatic spatial index utilization
-- 4. Sophisticated geofencing with entry/exit/dwell event detection
-- 5. Location analytics and heatmap generation with spatial aggregation
-- 6. Movement pattern analysis with trajectory and behavioral insights
-- 7. Real-time spatial event processing and notification triggers
-- 8. Integration with MongoDB's native 2dsphere indexing optimization
-- 9. Complex spatial queries with business logic and filtering
-- 10. Production-ready geospatial operations with familiar SQL syntax

Best Practices for Geospatial Implementation

Spatial Index Strategy and Performance Optimization

Essential principles for effective MongoDB geospatial deployment:

Index Design: Create compound spatial indexes that combine location data with frequently queried attributes
Query Optimization: Structure queries to leverage spatial indexes effectively and minimize computational overhead
Coordinate System: Standardize on WGS84 (EPSG:4326) for consistency and optimal MongoDB performance
Data Validation: Implement comprehensive GeoJSON validation to prevent spatial query errors
Scaling Strategy: Design geospatial collections for horizontal scaling with appropriate shard key selection
Caching Strategy: Implement spatial query result caching for frequently accessed location data

Production Deployment and Location Intelligence

Optimize geospatial operations for enterprise-scale location-based services:

Real-Time Processing: Leverage change streams and geofencing for responsive location-aware applications
Analytics Integration: Combine spatial data with business intelligence for location-driven insights
Privacy Compliance: Implement location data privacy controls and user consent management
Performance Monitoring: Track spatial query performance and optimize based on usage patterns
Fault Tolerance: Design location services with redundancy and failover capabilities
Mobile Optimization: Optimize for mobile device constraints including battery usage and network efficiency

Conclusion

MongoDB geospatial capabilities provide comprehensive native location-based services that eliminate the complexity of external GIS systems through advanced spatial indexing, sophisticated geometric operations, and seamless integration with application data models. The combination of high-performance spatial queries with real-time geofencing and location analytics makes MongoDB ideal for modern location-aware applications.

Key MongoDB Geospatial benefits include:

Native Spatial Indexing: Advanced 2dsphere indexes with optimized geometric operations and coordinate system support
Comprehensive GeoJSON Support: Full support for points, polygons, lines, and complex geometries with native validation
High-Performance Proximity: Optimized distance calculations and bearing analysis for location-based queries
Real-Time Geofencing: Advanced geofence event detection with entry, exit, and dwell time triggers
Location Analytics: Sophisticated spatial aggregation for heatmaps, movement patterns, and location intelligence
SQL Accessibility: Familiar SQL-style spatial operations through QueryLeaf for accessible geospatial development

Whether you're building ride-sharing platforms, delivery services, social media applications, or location-based marketing systems, MongoDB geospatial capabilities with QueryLeaf's familiar SQL interface provide the foundation for scalable, high-performance location services.

QueryLeaf Integration: QueryLeaf seamlessly manages MongoDB geospatial operations while providing SQL-familiar spatial data types, indexing strategies, and location-based query capabilities. Advanced geospatial patterns including proximity searches, geofencing, movement analysis, and location analytics are elegantly handled through familiar SQL constructs, making sophisticated location-based services both powerful and accessible to SQL-oriented development teams.

The combination of MongoDB's robust geospatial capabilities with SQL-style location operations makes it an ideal platform for applications requiring both advanced spatial functionality and familiar database interaction patterns, ensuring your location services can scale efficiently while delivering precise, real-time geographic experiences.

November 3, 2025
23 min read

MongoDB Data Pipeline Optimization and Stream Processing: Advanced Real-Time Analytics for High-Performance Data Workflows

Modern applications require sophisticated data processing capabilities that can handle high-velocity data streams, complex analytical workloads, and real-time insights while maintaining optimal performance under varying load conditions. Traditional data pipeline approaches often struggle with complex transformation logic, performance bottlenecks in aggregation operations, and the operational complexity of maintaining separate systems for batch and stream processing, leading to increased latency, resource inefficiency, and difficulty in maintaining data consistency across processing workflows.

MongoDB provides comprehensive data pipeline capabilities through the Aggregation Framework, Change Streams, and advanced stream processing features that enable real-time analytics, complex data transformations, and high-performance data processing within a single unified platform. Unlike traditional approaches that require multiple specialized systems and complex integration logic, MongoDB's integrated data pipeline capabilities deliver superior performance through native optimization, intelligent query planning, and seamless integration with storage and indexing systems.

The Traditional Data Pipeline Challenge

Conventional data processing architectures face significant limitations when handling complex analytical workloads:

-- Traditional PostgreSQL data pipeline - complex ETL processes with performance limitations

-- Basic data transformation pipeline with limited optimization capabilities
CREATE TABLE raw_events (
    event_id BIGSERIAL PRIMARY KEY,
    event_timestamp TIMESTAMP NOT NULL,
    user_id BIGINT NOT NULL,
    session_id VARCHAR(100),
    event_type VARCHAR(100) NOT NULL,
    event_category VARCHAR(100),

    -- Basic event data (limited nested structure support)
    event_data JSONB,
    device_info JSONB,
    location_data JSONB,

    -- Processing metadata
    ingested_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    processed_at TIMESTAMP,
    processing_status VARCHAR(50) DEFAULT 'pending',

    -- Partitioning key
    partition_date DATE GENERATED ALWAYS AS (event_timestamp::date) STORED

) PARTITION BY RANGE (partition_date);

-- Create monthly partitions (manual maintenance required)
CREATE TABLE raw_events_2024_11 PARTITION OF raw_events
    FOR VALUES FROM ('2024-11-01') TO ('2024-12-01');
CREATE TABLE raw_events_2024_12 PARTITION OF raw_events  
    FOR VALUES FROM ('2024-12-01') TO ('2025-01-01');

-- Complex data transformation pipeline with performance issues
CREATE OR REPLACE FUNCTION process_event_batch(
    batch_size INTEGER DEFAULT 1000
) RETURNS TABLE (
    processed_events INTEGER,
    failed_events INTEGER,
    processing_time_ms INTEGER,
    transformation_errors TEXT[]
) AS $$
DECLARE
    batch_start_time TIMESTAMP;
    processing_errors TEXT[] := '{}';
    events_processed INTEGER := 0;
    events_failed INTEGER := 0;
    event_record RECORD;
BEGIN
    batch_start_time := clock_timestamp();

    -- Process events in batches (inefficient row-by-row processing)
    FOR event_record IN 
        SELECT * FROM raw_events 
        WHERE processing_status = 'pending'
        ORDER BY ingested_at
        LIMIT batch_size
    LOOP
        BEGIN
            -- Complex transformation logic (limited JSON processing capabilities)
            WITH transformed_event AS (
                SELECT 
                    event_record.event_id,
                    event_record.event_timestamp,
                    event_record.user_id,
                    event_record.session_id,
                    event_record.event_type,
                    event_record.event_category,

                    -- Basic data extraction and transformation
                    COALESCE(event_record.event_data->>'revenue', '0')::DECIMAL(10,2) as revenue,
                    COALESCE(event_record.event_data->>'quantity', '1')::INTEGER as quantity,
                    event_record.event_data->>'product_id' as product_id,
                    event_record.event_data->>'product_name' as product_name,

                    -- Device information extraction (limited nested processing)
                    event_record.device_info->>'device_type' as device_type,
                    event_record.device_info->>'browser' as browser,
                    event_record.device_info->>'os' as operating_system,

                    -- Location processing (basic only)
                    event_record.location_data->>'country' as country,
                    event_record.location_data->>'region' as region,
                    event_record.location_data->>'city' as city,

                    -- Time-based calculations
                    EXTRACT(HOUR FROM event_record.event_timestamp) as event_hour,
                    EXTRACT(DOW FROM event_record.event_timestamp) as day_of_week,
                    TO_CHAR(event_record.event_timestamp, 'YYYY-MM') as year_month,

                    -- User segmentation (basic logic only)
                    CASE 
                        WHEN user_segments.segment_type IS NOT NULL THEN user_segments.segment_type
                        ELSE 'unknown'
                    END as user_segment,

                    -- Processing metadata
                    CURRENT_TIMESTAMP as processed_at

                FROM raw_events re
                LEFT JOIN user_segments ON re.user_id = user_segments.user_id
                WHERE re.event_id = event_record.event_id
            )

            -- Insert into processed events table (separate table required)
            INSERT INTO processed_events (
                event_id, event_timestamp, user_id, session_id, event_type, event_category,
                revenue, quantity, product_id, product_name,
                device_type, browser, operating_system,
                country, region, city,
                event_hour, day_of_week, year_month, user_segment,
                processed_at
            )
            SELECT * FROM transformed_event;

            -- Update processing status
            UPDATE raw_events 
            SET 
                processed_at = CURRENT_TIMESTAMP,
                processing_status = 'completed'
            WHERE event_id = event_record.event_id;

            events_processed := events_processed + 1;

        EXCEPTION WHEN OTHERS THEN
            events_failed := events_failed + 1;
            processing_errors := array_append(processing_errors, 
                'Event ID ' || event_record.event_id || ': ' || SQLERRM);

            -- Mark event as failed
            UPDATE raw_events 
            SET 
                processed_at = CURRENT_TIMESTAMP,
                processing_status = 'failed'
            WHERE event_id = event_record.event_id;
        END;
    END LOOP;

    -- Return processing results
    RETURN QUERY SELECT 
        events_processed,
        events_failed,
        EXTRACT(MILLISECONDS FROM clock_timestamp() - batch_start_time)::INTEGER,
        processing_errors;

END;
$$ LANGUAGE plpgsql;

-- Execute batch processing (requires manual scheduling)
SELECT * FROM process_event_batch(1000);

-- Complex analytical query with performance limitations
WITH hourly_metrics AS (
    -- Time-based aggregation with limited optimization
    SELECT 
        DATE_TRUNC('hour', event_timestamp) as hour_bucket,
        event_type,
        event_category,
        user_segment,
        device_type,
        country,

        -- Basic aggregations (limited analytical functions)
        COUNT(*) as event_count,
        COUNT(DISTINCT user_id) as unique_users,
        COUNT(DISTINCT session_id) as unique_sessions,
        SUM(revenue) as total_revenue,
        AVG(revenue) FILTER (WHERE revenue > 0) as avg_revenue_per_transaction,

        -- Limited statistical functions
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY revenue) as median_revenue,
        STDDEV_POP(revenue) as revenue_stddev,

        -- Time-based calculations
        MIN(event_timestamp) as first_event_time,
        MAX(event_timestamp) as last_event_time

    FROM processed_events
    WHERE event_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    GROUP BY 
        DATE_TRUNC('hour', event_timestamp),
        event_type, event_category, user_segment, device_type, country
),

user_behavior_analysis AS (
    -- User journey analysis (complex and slow)
    SELECT 
        user_id,
        session_id,

        -- Session-level aggregations
        COUNT(*) as events_per_session,
        SUM(revenue) as session_revenue,
        EXTRACT(SECONDS FROM (MAX(event_timestamp) - MIN(event_timestamp))) as session_duration_seconds,

        -- Event sequence analysis (limited capabilities)
        string_agg(event_type, ' -> ' ORDER BY event_timestamp) as event_sequence,
        array_agg(event_timestamp ORDER BY event_timestamp) as event_timestamps,

        -- Conversion analysis
        CASE 
            WHEN 'purchase' = ANY(array_agg(event_type)) THEN 'converted'
            WHEN 'add_to_cart' = ANY(array_agg(event_type)) THEN 'engaged'
            ELSE 'browsing'
        END as conversion_status,

        -- Time-based metrics
        first_value(event_timestamp) OVER (
            PARTITION BY user_id 
            ORDER BY event_timestamp 
            ROWS UNBOUNDED PRECEDING
        ) as first_session_event,

        last_value(event_timestamp) OVER (
            PARTITION BY user_id 
            ORDER BY event_timestamp 
            ROWS UNBOUNDED FOLLOWING
        ) as last_session_event

    FROM processed_events
    WHERE event_timestamp >= CURRENT_TIMESTAMP - INTERVAL '7 days'
    GROUP BY user_id, session_id
),

funnel_analysis AS (
    -- Conversion funnel analysis (very limited and slow)
    SELECT 
        event_category,
        user_segment,

        -- Funnel step counts
        COUNT(*) FILTER (WHERE event_type = 'view') as step_1_views,
        COUNT(*) FILTER (WHERE event_type = 'click') as step_2_clicks,
        COUNT(*) FILTER (WHERE event_type = 'add_to_cart') as step_3_cart_adds,
        COUNT(*) FILTER (WHERE event_type = 'purchase') as step_4_purchases,

        -- Conversion rates (basic calculations)
        CASE 
            WHEN COUNT(*) FILTER (WHERE event_type = 'view') > 0 THEN
                (COUNT(*) FILTER (WHERE event_type = 'click') * 100.0) / 
                COUNT(*) FILTER (WHERE event_type = 'view')
            ELSE 0
        END as click_through_rate,

        CASE 
            WHEN COUNT(*) FILTER (WHERE event_type = 'add_to_cart') > 0 THEN
                (COUNT(*) FILTER (WHERE event_type = 'purchase') * 100.0) / 
                COUNT(*) FILTER (WHERE event_type = 'add_to_cart')
            ELSE 0
        END as cart_to_purchase_rate

    FROM processed_events
    WHERE event_timestamp >= CURRENT_TIMESTAMP - INTERVAL '30 days'
    GROUP BY event_category, user_segment
)

-- Final analytical output (limited insights)
SELECT 
    hm.hour_bucket,
    hm.event_type,
    hm.event_category,
    hm.user_segment,
    hm.device_type,
    hm.country,

    -- Volume metrics
    hm.event_count,
    hm.unique_users,
    hm.unique_sessions,
    ROUND(hm.event_count::DECIMAL / hm.unique_users, 2) as events_per_user,

    -- Revenue metrics  
    ROUND(hm.total_revenue, 2) as total_revenue,
    ROUND(hm.avg_revenue_per_transaction, 2) as avg_revenue_per_transaction,
    ROUND(hm.median_revenue, 2) as median_revenue,

    -- User behavior insights (very limited)
    (SELECT AVG(events_per_session) 
     FROM user_behavior_analysis uba 
     WHERE uba.session_revenue > 0) as avg_events_per_converting_session,

    -- Conversion insights
    fa.click_through_rate,
    fa.cart_to_purchase_rate,

    -- Performance indicators
    EXTRACT(MINUTES FROM (hm.last_event_time - hm.first_event_time)) as processing_window_minutes,

    -- Trend indicators (very basic)
    LAG(hm.event_count, 1) OVER (
        PARTITION BY hm.event_type, hm.user_segment 
        ORDER BY hm.hour_bucket
    ) as prev_hour_event_count

FROM hourly_metrics hm
LEFT JOIN funnel_analysis fa ON (
    hm.event_category = fa.event_category AND 
    hm.user_segment = fa.user_segment
)
WHERE hm.event_count > 10  -- Filter low-volume segments
ORDER BY hm.hour_bucket DESC, hm.total_revenue DESC
LIMIT 1000;

-- Problems with traditional data pipeline approaches:
-- 1. Complex ETL processes requiring separate batch processing jobs
-- 2. Limited support for nested and complex data structures  
-- 3. Poor performance with large-scale analytical workloads
-- 4. Manual partitioning and maintenance overhead
-- 5. No real-time stream processing capabilities
-- 6. Limited statistical and analytical functions
-- 7. Complex joins and data movement between processing stages
-- 8. No native support for time-series and event stream processing
-- 9. Difficulty in maintaining data consistency across pipeline stages
-- 10. Limited optimization for analytical query patterns

MongoDB provides comprehensive data pipeline capabilities with advanced stream processing and analytics:

// MongoDB Advanced Data Pipeline and Stream Processing - real-time analytics with optimized performance
const { MongoClient, GridFSBucket } = require('mongodb');
const { EventEmitter } = require('events');

// Comprehensive MongoDB Data Pipeline Manager
class AdvancedDataPipelineManager extends EventEmitter {
  constructor(connectionString, pipelineConfig = {}) {
    super();
    this.connectionString = connectionString;
    this.client = null;
    this.db = null;

    // Advanced pipeline configuration
    this.config = {
      // Pipeline processing configuration
      enableStreamProcessing: pipelineConfig.enableStreamProcessing !== false,
      enableRealTimeAnalytics: pipelineConfig.enableRealTimeAnalytics !== false,
      enableBatchProcessing: pipelineConfig.enableBatchProcessing !== false,

      // Performance optimization settings
      aggregationOptimization: pipelineConfig.aggregationOptimization !== false,
      indexOptimization: pipelineConfig.indexOptimization !== false,
      memoryOptimization: pipelineConfig.memoryOptimization !== false,
      parallelProcessing: pipelineConfig.parallelProcessing !== false,

      // Stream processing configuration
      changeStreamOptions: {
        fullDocument: 'updateLookup',
        maxAwaitTimeMS: 1000,
        batchSize: 1000,
        ...pipelineConfig.changeStreamOptions
      },

      // Batch processing configuration
      batchSize: pipelineConfig.batchSize || 5000,
      maxBatchProcessingTime: pipelineConfig.maxBatchProcessingTime || 300000, // 5 minutes

      // Analytics configuration
      analyticsWindowSize: pipelineConfig.analyticsWindowSize || 3600000, // 1 hour
      retentionPeriod: pipelineConfig.retentionPeriod || 90 * 24 * 60 * 60 * 1000, // 90 days

      // Performance monitoring
      enablePerformanceMetrics: pipelineConfig.enablePerformanceMetrics !== false,
      enablePipelineOptimization: pipelineConfig.enablePipelineOptimization !== false
    };

    // Pipeline state management
    this.activePipelines = new Map();
    this.streamProcessors = new Map();
    this.batchProcessors = new Map();
    this.performanceMetrics = {
      pipelinesExecuted: 0,
      totalProcessingTime: 0,
      documentsProcessed: 0,
      averageThroughput: 0
    };

    this.initializeDataPipeline();
  }

  async initializeDataPipeline() {
    console.log('Initializing advanced data pipeline system...');

    try {
      // Connect to MongoDB
      this.client = new MongoClient(this.connectionString);
      await this.client.connect();
      this.db = this.client.db();

      // Setup collections and indexes
      await this.setupPipelineInfrastructure();

      // Initialize stream processing
      if (this.config.enableStreamProcessing) {
        await this.initializeStreamProcessing();
      }

      // Initialize batch processing
      if (this.config.enableBatchProcessing) {
        await this.initializeBatchProcessing();
      }

      // Setup real-time analytics
      if (this.config.enableRealTimeAnalytics) {
        await this.setupRealTimeAnalytics();
      }

      console.log('Advanced data pipeline system initialized successfully');

    } catch (error) {
      console.error('Error initializing data pipeline:', error);
      throw error;
    }
  }

  async setupPipelineInfrastructure() {
    console.log('Setting up data pipeline infrastructure...');

    try {
      // Create collections with optimized configuration
      const collections = {
        rawEvents: this.db.collection('raw_events'),
        processedEvents: this.db.collection('processed_events'),
        analyticsResults: this.db.collection('analytics_results'),
        pipelineMetrics: this.db.collection('pipeline_metrics'),
        userSessions: this.db.collection('user_sessions'),
        conversionFunnels: this.db.collection('conversion_funnels')
      };

      // Create optimized indexes for high-performance data processing
      await collections.rawEvents.createIndexes([
        { key: { eventTimestamp: -1 }, background: true },
        { key: { userId: 1, sessionId: 1, eventTimestamp: -1 }, background: true },
        { key: { eventType: 1, eventCategory: 1, eventTimestamp: -1 }, background: true },
        { key: { 'processingStatus': 1, 'ingestedAt': 1 }, background: true },
        { key: { 'locationData.country': 1, 'deviceInfo.deviceType': 1 }, background: true, sparse: true }
      ]);

      await collections.processedEvents.createIndexes([
        { key: { eventTimestamp: -1 }, background: true },
        { key: { userId: 1, sessionId: 1, eventTimestamp: -1 }, background: true },
        { key: { eventType: 1, userSegment: 1, eventTimestamp: -1 }, background: true },
        { key: { 'metrics.revenue': -1, eventTimestamp: -1 }, background: true, sparse: true }
      ]);

      await collections.analyticsResults.createIndexes([
        { key: { analysisType: 1, timeWindow: -1 }, background: true },
        { key: { 'dimensions.eventType': 1, 'dimensions.userSegment': 1, timeWindow: -1 }, background: true },
        { key: { createdAt: -1 }, background: true }
      ]);

      this.collections = collections;

      console.log('Pipeline infrastructure setup completed');

    } catch (error) {
      console.error('Error setting up pipeline infrastructure:', error);
      throw error;
    }
  }

  async createAdvancedAnalyticsPipeline(pipelineConfig) {
    console.log('Creating advanced analytics pipeline...');

    const pipelineId = this.generatePipelineId();
    const startTime = Date.now();

    try {
      // Build comprehensive aggregation pipeline
      const analyticsStages = [
        // Stage 1: Data filtering and initial processing
        {
          $match: {
            eventTimestamp: {
              $gte: new Date(Date.now() - this.config.analyticsWindowSize),
              $lte: new Date()
            },
            processingStatus: 'completed',
            ...pipelineConfig.matchCriteria
          }
        },

        // Stage 2: Advanced data transformation and enrichment
        {
          $addFields: {
            // Time-based dimensions
            hourBucket: {
              $dateFromParts: {
                year: { $year: '$eventTimestamp' },
                month: { $month: '$eventTimestamp' },
                day: { $dayOfMonth: '$eventTimestamp' },
                hour: { $hour: '$eventTimestamp' }
              }
            },
            dayOfWeek: { $dayOfWeek: '$eventTimestamp' },
            yearMonth: {
              $dateToString: {
                format: '%Y-%m',
                date: '$eventTimestamp'
              }
            },

            // User segmentation and classification
            userSegment: {
              $switch: {
                branches: [
                  {
                    case: { $gte: ['$userMetrics.totalRevenue', 1000] },
                    then: 'high_value'
                  },
                  {
                    case: { $gte: ['$userMetrics.totalRevenue', 100] },
                    then: 'medium_value'
                  },
                  {
                    case: { $gt: ['$userMetrics.totalRevenue', 0] },
                    then: 'low_value'
                  }
                ],
                default: 'non_revenue'
              }
            },

            // Device and technology classification
            deviceCategory: {
              $switch: {
                branches: [
                  {
                    case: { $in: ['$deviceInfo.deviceType', ['smartphone', 'tablet']] },
                    then: 'mobile'
                  },
                  {
                    case: { $eq: ['$deviceInfo.deviceType', 'desktop'] },
                    then: 'desktop'
                  }
                ],
                default: 'other'
              }
            },

            // Geographic clustering
            geoRegion: {
              $switch: {
                branches: [
                  {
                    case: { $in: ['$locationData.country', ['US', 'CA', 'MX']] },
                    then: 'North America'
                  },
                  {
                    case: { $in: ['$locationData.country', ['GB', 'DE', 'FR', 'IT', 'ES']] },
                    then: 'Europe'
                  },
                  {
                    case: { $in: ['$locationData.country', ['JP', 'KR', 'CN', 'IN']] },
                    then: 'Asia'
                  }
                ],
                default: 'Other'
              }
            },

            // Revenue and value metrics
            revenueMetrics: {
              revenue: { $toDouble: '$eventData.revenue' },
              quantity: { $toInt: '$eventData.quantity' },
              averageOrderValue: {
                $cond: [
                  { $gt: [{ $toInt: '$eventData.quantity' }, 0] },
                  { $divide: [{ $toDouble: '$eventData.revenue' }, { $toInt: '$eventData.quantity' }] },
                  0
                ]
              }
            }
          }
        },

        // Stage 3: Multi-dimensional aggregation and analytics
        {
          $group: {
            _id: {
              hourBucket: '$hourBucket',
              eventType: '$eventType',
              eventCategory: '$eventCategory',
              userSegment: '$userSegment',
              deviceCategory: '$deviceCategory',
              geoRegion: '$geoRegion'
            },

            // Volume metrics
            eventCount: { $sum: 1 },
            uniqueUsers: { $addToSet: '$userId' },
            uniqueSessions: { $addToSet: '$sessionId' },

            // Revenue analytics
            totalRevenue: { $sum: '$revenueMetrics.revenue' },
            totalQuantity: { $sum: '$revenueMetrics.quantity' },
            revenueTransactions: {
              $sum: {
                $cond: [{ $gt: ['$revenueMetrics.revenue', 0] }, 1, 0]
              }
            },

            // Statistical aggregations
            revenueValues: { $push: '$revenueMetrics.revenue' },
            quantityValues: { $push: '$revenueMetrics.quantity' },
            avgOrderValues: { $push: '$revenueMetrics.averageOrderValue' },

            // Time-based analytics
            firstEventTime: { $min: '$eventTimestamp' },
            lastEventTime: { $max: '$eventTimestamp' },
            eventTimestamps: { $push: '$eventTimestamp' },

            // User behavior patterns
            userSessions: {
              $push: {
                userId: '$userId',
                sessionId: '$sessionId',
                eventTimestamp: '$eventTimestamp',
                revenue: '$revenueMetrics.revenue'
              }
            }
          }
        },

        // Stage 4: Advanced statistical calculations
        {
          $addFields: {
            // User metrics
            uniqueUserCount: { $size: '$uniqueUsers' },
            uniqueSessionCount: { $size: '$uniqueSessions' },
            eventsPerUser: {
              $divide: ['$eventCount', { $size: '$uniqueUsers' }]
            },
            eventsPerSession: {
              $divide: ['$eventCount', { $size: '$uniqueSessions' }]
            },

            // Revenue analytics
            averageRevenue: {
              $cond: [
                { $gt: ['$revenueTransactions', 0] },
                { $divide: ['$totalRevenue', '$revenueTransactions'] },
                0
              ]
            },
            revenuePerUser: {
              $divide: ['$totalRevenue', { $size: '$uniqueUsers' }]
            },
            conversionRate: {
              $divide: ['$revenueTransactions', '$eventCount']
            },

            // Statistical measures
            revenueStats: {
              $let: {
                vars: {
                  sortedRevenues: {
                    $sortArray: {
                      input: '$revenueValues',
                      sortBy: 1
                    }
                  }
                },
                in: {
                  median: {
                    $arrayElemAt: [
                      '$$sortedRevenues',
                      { $floor: { $multiply: [{ $size: '$$sortedRevenues' }, 0.5] } }
                    ]
                  },
                  percentile75: {
                    $arrayElemAt: [
                      '$$sortedRevenues',
                      { $floor: { $multiply: [{ $size: '$$sortedRevenues' }, 0.75] } }
                    ]
                  },
                  percentile95: {
                    $arrayElemAt: [
                      '$$sortedRevenues',
                      { $floor: { $multiply: [{ $size: '$$sortedRevenues' }, 0.95] } }
                    ]
                  }
                }
              }
            },

            // Temporal analysis
            processingWindowMinutes: {
              $divide: [
                { $subtract: ['$lastEventTime', '$firstEventTime'] },
                60000 // Convert to minutes
              ]
            },

            // Session analysis
            sessionMetrics: {
              $reduce: {
                input: '$userSessions',
                initialValue: {
                  totalSessions: 0,
                  convertingSessions: 0,
                  totalSessionRevenue: 0
                },
                in: {
                  totalSessions: { $add: ['$$value.totalSessions', 1] },
                  convertingSessions: {
                    $cond: [
                      { $gt: ['$$this.revenue', 0] },
                      { $add: ['$$value.convertingSessions', 1] },
                      '$$value.convertingSessions'
                    ]
                  },
                  totalSessionRevenue: {
                    $add: ['$$value.totalSessionRevenue', '$$this.revenue']
                  }
                }
              }
            }
          }
        },

        // Stage 5: Performance optimization and data enrichment
        {
          $addFields: {
            // Performance indicators
            performanceMetrics: {
              throughputEventsPerMinute: {
                $divide: ['$eventCount', '$processingWindowMinutes']
              },
              revenueVelocity: {
                $divide: ['$totalRevenue', '$processingWindowMinutes']
              },
              userEngagementRate: {
                $divide: [{ $size: '$uniqueUsers' }, '$eventCount']
              }
            },

            // Business metrics
            businessMetrics: {
              customerLifetimeValue: {
                $multiply: [
                  '$revenuePerUser',
                  { $literal: 12 } // Assuming 12-month projection
                ]
              },
              sessionConversionRate: {
                $divide: [
                  '$sessionMetrics.convertingSessions',
                  '$sessionMetrics.totalSessions'
                ]
              },
              averageSessionValue: {
                $divide: [
                  '$sessionMetrics.totalSessionRevenue',
                  '$sessionMetrics.totalSessions'
                ]
              }
            },

            // Data quality metrics
            dataQuality: {
              completenessScore: {
                $divide: [
                  { $add: [
                    { $cond: [{ $gt: [{ $size: '$uniqueUsers' }, 0] }, 1, 0] },
                    { $cond: [{ $gt: ['$eventCount', 0] }, 1, 0] },
                    { $cond: [{ $ne: ['$_id.eventType', null] }, 1, 0] },
                    { $cond: [{ $ne: ['$_id.eventCategory', null] }, 1, 0] }
                  ] },
                  4
                ]
              },
              consistencyScore: {
                $cond: [
                  { $eq: ['$eventsPerSession', { $divide: ['$eventCount', { $size: '$uniqueSessions' }] }] },
                  1.0,
                  0.8
                ]
              }
            }
          }
        },

        // Stage 6: Final result formatting and metadata
        {
          $project: {
            // Dimension information
            dimensions: '$_id',
            timeWindow: '$_id.hourBucket',
            analysisType: { $literal: pipelineConfig.analysisType || 'comprehensive_analytics' },

            // Core metrics
            metrics: {
              volume: {
                eventCount: '$eventCount',
                uniqueUserCount: '$uniqueUserCount',
                uniqueSessionCount: '$uniqueSessionCount',
                eventsPerUser: { $round: ['$eventsPerUser', 2] },
                eventsPerSession: { $round: ['$eventsPerSession', 2] }
              },

              revenue: {
                totalRevenue: { $round: ['$totalRevenue', 2] },
                totalQuantity: '$totalQuantity',
                revenueTransactions: '$revenueTransactions',
                averageRevenue: { $round: ['$averageRevenue', 2] },
                revenuePerUser: { $round: ['$revenuePerUser', 2] },
                conversionRate: { $round: ['$conversionRate', 4] }
              },

              statistical: {
                medianRevenue: { $round: ['$revenueStats.median', 2] },
                percentile75Revenue: { $round: ['$revenueStats.percentile75', 2] },
                percentile95Revenue: { $round: ['$revenueStats.percentile95', 2] }
              },

              performance: '$performanceMetrics',
              business: '$businessMetrics',
              dataQuality: '$dataQuality'
            },

            // Temporal information
            temporal: {
              firstEventTime: '$firstEventTime',
              lastEventTime: '$lastEventTime',
              processingWindowMinutes: { $round: ['$processingWindowMinutes', 1] }
            },

            // Pipeline metadata
            pipelineMetadata: {
              pipelineId: { $literal: pipelineId },
              executionTime: { $literal: new Date() },
              configurationUsed: { $literal: pipelineConfig }
            }
          }
        },

        // Stage 7: Results persistence and optimization
        {
          $merge: {
            into: 'analytics_results',
            whenMatched: 'replace',
            whenNotMatched: 'insert'
          }
        }
      ];

      // Execute the comprehensive analytics pipeline
      console.log('Executing comprehensive analytics pipeline...');
      const pipelineResult = await this.collections.processedEvents.aggregate(
        analyticsStages,
        {
          allowDiskUse: true,
          maxTimeMS: this.config.maxBatchProcessingTime,
          hint: { eventTimestamp: -1 }, // Optimize with time-based index
          comment: `Advanced analytics pipeline: ${pipelineId}`
        }
      ).toArray();

      const executionTime = Date.now() - startTime;

      // Update performance metrics
      this.updatePipelineMetrics(pipelineId, {
        executionTime: executionTime,
        documentsProcessed: pipelineResult.length,
        pipelineType: 'analytics',
        success: true
      });

      this.emit('pipelineCompleted', {
        pipelineId: pipelineId,
        pipelineType: 'analytics',
        executionTime: executionTime,
        documentsProcessed: pipelineResult.length,
        resultsGenerated: pipelineResult.length
      });

      console.log(`Analytics pipeline completed: ${pipelineId} (${executionTime}ms, ${pipelineResult.length} results)`);

      return {
        success: true,
        pipelineId: pipelineId,
        executionTime: executionTime,
        resultsGenerated: pipelineResult.length,
        analyticsData: pipelineResult
      };

    } catch (error) {
      console.error(`Analytics pipeline failed: ${pipelineId}`, error);

      this.updatePipelineMetrics(pipelineId, {
        executionTime: Date.now() - startTime,
        pipelineType: 'analytics',
        success: false,
        error: error.message
      });

      return {
        success: false,
        pipelineId: pipelineId,
        error: error.message
      };
    }
  }

  async initializeStreamProcessing() {
    console.log('Initializing real-time stream processing...');

    try {
      // Setup change streams for real-time processing
      const changeStream = this.collections.rawEvents.watch(
        [
          {
            $match: {
              'operationType': { $in: ['insert', 'update'] },
              'fullDocument.processingStatus': { $ne: 'processed' }
            }
          }
        ],
        this.config.changeStreamOptions
      );

      // Process streaming data in real-time
      changeStream.on('change', async (change) => {
        try {
          await this.processStreamingEvent(change);
        } catch (error) {
          console.error('Error processing streaming event:', error);
          this.emit('streamProcessingError', { change, error: error.message });
        }
      });

      changeStream.on('error', (error) => {
        console.error('Change stream error:', error);
        this.emit('changeStreamError', { error: error.message });
      });

      this.streamProcessors.set('main', changeStream);

      console.log('Stream processing initialized successfully');

    } catch (error) {
      console.error('Error initializing stream processing:', error);
      throw error;
    }
  }

  async processStreamingEvent(change) {
    console.log('Processing streaming event:', change.documentKey);

    const document = change.fullDocument;
    const processingStartTime = Date.now();

    try {
      // Real-time event transformation and enrichment
      const transformedEvent = await this.transformEventData(document);

      // Apply real-time analytics calculations
      const analyticsData = await this.calculateRealTimeMetrics(transformedEvent);

      // Update processed events collection
      await this.collections.processedEvents.replaceOne(
        { _id: transformedEvent._id },
        {
          ...transformedEvent,
          ...analyticsData,
          processedAt: new Date(),
          processingLatency: Date.now() - processingStartTime
        },
        { upsert: true }
      );

      // Update real-time analytics aggregations
      if (this.config.enableRealTimeAnalytics) {
        await this.updateRealTimeAnalytics(transformedEvent);
      }

      this.emit('eventProcessed', {
        eventId: document._id,
        processingLatency: Date.now() - processingStartTime,
        analyticsGenerated: Object.keys(analyticsData).length
      });

    } catch (error) {
      console.error('Error processing streaming event:', error);

      // Mark event as failed for retry processing
      await this.collections.rawEvents.updateOne(
        { _id: document._id },
        {
          $set: {
            processingStatus: 'failed',
            processingError: error.message,
            lastProcessingAttempt: new Date()
          }
        }
      );

      throw error;
    }
  }

  async transformEventData(rawEvent) {
    // Advanced event data transformation with MongoDB-specific optimizations
    const transformed = {
      _id: rawEvent._id,
      eventId: rawEvent.eventId || rawEvent._id,
      eventTimestamp: rawEvent.eventTimestamp,
      userId: rawEvent.userId,
      sessionId: rawEvent.sessionId,
      eventType: rawEvent.eventType,
      eventCategory: rawEvent.eventCategory,

      // Enhanced data extraction using MongoDB operators
      eventData: {
        ...rawEvent.eventData,
        revenue: parseFloat(rawEvent.eventData?.revenue || 0),
        quantity: parseInt(rawEvent.eventData?.quantity || 1),
        productId: rawEvent.eventData?.productId,
        productName: rawEvent.eventData?.productName
      },

      // Device and technology information
      deviceInfo: {
        deviceType: rawEvent.deviceInfo?.deviceType || 'unknown',
        browser: rawEvent.deviceInfo?.browser || 'unknown',
        operatingSystem: rawEvent.deviceInfo?.os || 'unknown',
        screenResolution: rawEvent.deviceInfo?.screenResolution,
        userAgent: rawEvent.deviceInfo?.userAgent
      },

      // Geographic information
      locationData: {
        country: rawEvent.locationData?.country || 'unknown',
        region: rawEvent.locationData?.region || 'unknown',
        city: rawEvent.locationData?.city || 'unknown',
        coordinates: rawEvent.locationData?.coordinates
      },

      // Time-based dimensions for efficient aggregation
      timeDimensions: {
        hour: rawEvent.eventTimestamp.getHours(),
        dayOfWeek: rawEvent.eventTimestamp.getDay(),
        yearMonth: `${rawEvent.eventTimestamp.getFullYear()}-${String(rawEvent.eventTimestamp.getMonth() + 1).padStart(2, '0')}`,
        quarterYear: `Q${Math.floor(rawEvent.eventTimestamp.getMonth() / 3) + 1}-${rawEvent.eventTimestamp.getFullYear()}`
      },

      // Processing metadata
      processingMetadata: {
        transformedAt: new Date(),
        version: '2.0',
        source: 'stream_processor'
      }
    };

    return transformed;
  }

  async calculateRealTimeMetrics(event) {
    // Real-time metrics calculation using MongoDB aggregation
    const metricsCalculation = [
      {
        $match: {
          userId: event.userId,
          eventTimestamp: {
            $gte: new Date(Date.now() - 24 * 60 * 60 * 1000) // Last 24 hours
          }
        }
      },
      {
        $group: {
          _id: null,
          totalEvents: { $sum: 1 },
          totalRevenue: { $sum: '$eventData.revenue' },
          uniqueSessions: { $addToSet: '$sessionId' },
          eventTypes: { $addToSet: '$eventType' },
          averageOrderValue: { $avg: '$eventData.revenue' }
        }
      }
    ];

    const userMetrics = await this.collections.processedEvents
      .aggregate(metricsCalculation)
      .toArray();

    return {
      userMetrics: userMetrics[0] || {
        totalEvents: 1,
        totalRevenue: event.eventData.revenue,
        uniqueSessions: [event.sessionId],
        eventTypes: [event.eventType],
        averageOrderValue: event.eventData.revenue
      }
    };
  }

  updatePipelineMetrics(pipelineId, metrics) {
    // Update system-wide pipeline performance metrics
    this.performanceMetrics.pipelinesExecuted++;
    this.performanceMetrics.totalProcessingTime += metrics.executionTime;
    this.performanceMetrics.documentsProcessed += metrics.documentsProcessed || 0;

    if (this.performanceMetrics.pipelinesExecuted > 0) {
      this.performanceMetrics.averageThroughput = 
        this.performanceMetrics.documentsProcessed / 
        (this.performanceMetrics.totalProcessingTime / 1000);
    }

    // Store detailed pipeline metrics
    this.collections.pipelineMetrics.insertOne({
      pipelineId: pipelineId,
      metrics: metrics,
      timestamp: new Date(),
      systemMetrics: {
        memoryUsage: process.memoryUsage(),
        systemPerformance: this.performanceMetrics
      }
    }).catch(error => {
      console.error('Error storing pipeline metrics:', error);
    });
  }

  generatePipelineId() {
    const timestamp = new Date().toISOString().replace(/[:.]/g, '-');
    return `pipeline_${timestamp}_${Math.random().toString(36).substr(2, 9)}`;
  }

  async shutdown() {
    console.log('Shutting down data pipeline manager...');

    try {
      // Close all active stream processors
      for (const [processorId, stream] of this.streamProcessors.entries()) {
        await stream.close();
        console.log(`Closed stream processor: ${processorId}`);
      }

      // Close MongoDB connection
      if (this.client) {
        await this.client.close();
      }

      console.log('Data pipeline manager shutdown complete');

    } catch (error) {
      console.error('Error during shutdown:', error);
    }
  }
}

// Benefits of MongoDB Advanced Data Pipeline:
// - Real-time stream processing with Change Streams for immediate insights
// - Comprehensive aggregation framework for complex analytical workloads
// - Native support for nested and complex data structures without ETL overhead
// - Optimized indexing and query planning for high-performance analytics
// - Integrated batch and stream processing within a single platform
// - Advanced statistical and mathematical functions for sophisticated analytics
// - Automatic scaling and optimization for large-scale data processing
// - SQL-compatible pipeline management through QueryLeaf integration
// - Built-in performance monitoring and optimization capabilities
// - Production-ready stream processing with minimal configuration overhead

module.exports = {
  AdvancedDataPipelineManager
};

Understanding MongoDB Data Pipeline Architecture

Advanced Stream Processing and Real-Time Analytics Patterns

Implement sophisticated data pipeline workflows for production MongoDB deployments:

// Enterprise-grade MongoDB data pipeline with advanced stream processing and analytics optimization
class EnterpriseDataPipelineProcessor extends AdvancedDataPipelineManager {
  constructor(connectionString, enterpriseConfig) {
    super(connectionString, enterpriseConfig);

    this.enterpriseConfig = {
      ...enterpriseConfig,
      enableAdvancedAnalytics: true,
      enableMachineLearningPipelines: true,
      enablePredictiveAnalytics: true,
      enableDataGovernance: true,
      enableComplianceReporting: true
    };

    this.setupEnterpriseFeatures();
    this.initializePredictiveAnalytics();
    this.setupComplianceFramework();
  }

  async implementAdvancedDataPipeline() {
    console.log('Implementing enterprise data pipeline with advanced capabilities...');

    const pipelineStrategy = {
      // Multi-tier processing architecture
      processingTiers: {
        realTimeProcessing: {
          latencyTarget: 100, // milliseconds
          throughputTarget: 100000, // events per second
          consistencyLevel: 'eventual'
        },
        nearRealTimeProcessing: {
          latencyTarget: 5000, // 5 seconds
          throughputTarget: 50000,
          consistencyLevel: 'strong'
        },
        batchProcessing: {
          latencyTarget: 300000, // 5 minutes
          throughputTarget: 1000000,
          consistencyLevel: 'strong'
        }
      },

      // Advanced analytics capabilities
      analyticsCapabilities: {
        descriptiveAnalytics: true,
        diagnosticAnalytics: true,
        predictiveAnalytics: true,
        prescriptiveAnalytics: true
      },

      // Data governance and compliance
      dataGovernance: {
        dataLineageTracking: true,
        dataQualityMonitoring: true,
        privacyCompliance: true,
        auditTrailMaintenance: true
      }
    };

    return await this.deployEnterpriseStrategy(pipelineStrategy);
  }

  async setupPredictiveAnalytics() {
    console.log('Setting up predictive analytics capabilities...');

    const predictiveConfig = {
      // Machine learning models
      models: {
        churnPrediction: true,
        revenueForecasting: true,
        behaviorPrediction: true,
        anomalyDetection: true
      },

      // Feature engineering
      featureEngineering: {
        temporalFeatures: true,
        behavioralFeatures: true,
        demographicFeatures: true,
        interactionFeatures: true
      },

      // Model deployment
      modelDeployment: {
        realTimeScoring: true,
        batchScoring: true,
        modelVersioning: true,
        performanceMonitoring: true
      }
    };

    return await this.deployPredictiveAnalytics(predictiveConfig);
  }
}

SQL-Style Data Pipeline Operations with QueryLeaf

QueryLeaf provides familiar SQL syntax for MongoDB data pipeline operations and stream processing:

-- QueryLeaf advanced data pipeline operations with SQL-familiar syntax for MongoDB

-- Configure comprehensive data pipeline strategy
CONFIGURE DATA_PIPELINE 
SET pipeline_name = 'enterprise_analytics_pipeline',
    processing_modes = ['real_time', 'batch', 'stream'],

    -- Real-time processing configuration
    stream_processing_enabled = true,
    stream_latency_target_ms = 100,
    stream_throughput_target = 100000,
    change_stream_batch_size = 1000,

    -- Batch processing configuration
    batch_processing_enabled = true,
    batch_size = 10000,
    batch_processing_interval_minutes = 5,
    max_batch_processing_time_minutes = 30,

    -- Analytics configuration
    enable_real_time_analytics = true,
    analytics_window_size_hours = 24,
    enable_predictive_analytics = true,
    enable_statistical_functions = true,

    -- Performance optimization
    enable_aggregation_optimization = true,
    enable_index_optimization = true,
    enable_parallel_processing = true,
    max_memory_usage_gb = 8,

    -- Data governance
    enable_data_lineage_tracking = true,
    enable_data_quality_monitoring = true,
    enable_audit_trail = true,
    data_retention_days = 90;

-- Advanced multi-dimensional analytics pipeline with comprehensive transformations
WITH event_enrichment AS (
  SELECT 
    event_id,
    event_timestamp,
    user_id,
    session_id,
    event_type,
    event_category,

    -- Advanced data extraction and type conversion
    CAST(event_data->>'revenue' AS DECIMAL(10,2)) as revenue,
    CAST(event_data->>'quantity' AS INTEGER) as quantity,
    event_data->>'product_id' as product_id,
    event_data->>'product_name' as product_name,
    event_data->>'campaign_id' as campaign_id,

    -- Device and technology classification
    CASE 
      WHEN device_info->>'device_type' IN ('smartphone', 'tablet') THEN 'mobile'
      WHEN device_info->>'device_type' = 'desktop' THEN 'desktop'
      ELSE 'other'
    END as device_category,

    device_info->>'browser' as browser,
    device_info->>'operating_system' as operating_system,

    -- Geographic dimensions
    location_data->>'country' as country,
    location_data->>'region' as region,
    location_data->>'city' as city,

    -- Advanced geographic clustering
    CASE 
      WHEN location_data->>'country' IN ('US', 'CA', 'MX') THEN 'North America'
      WHEN location_data->>'country' IN ('GB', 'DE', 'FR', 'IT', 'ES', 'NL') THEN 'Europe'
      WHEN location_data->>'country' IN ('JP', 'KR', 'CN', 'IN', 'SG') THEN 'Asia Pacific'
      WHEN location_data->>'country' IN ('BR', 'AR', 'CL', 'CO') THEN 'Latin America'
      ELSE 'Other'
    END as geo_region,

    -- Time-based dimensions for efficient aggregation
    DATE_TRUNC('hour', event_timestamp) as hour_bucket,
    EXTRACT(HOUR FROM event_timestamp) as event_hour,
    EXTRACT(DOW FROM event_timestamp) as day_of_week,
    EXTRACT(WEEK FROM event_timestamp) as week_of_year,
    EXTRACT(MONTH FROM event_timestamp) as month_of_year,
    TO_CHAR(event_timestamp, 'YYYY-MM') as year_month,
    TO_CHAR(event_timestamp, 'YYYY-"Q"Q') as year_quarter,

    -- Advanced user segmentation
    CASE 
      WHEN user_metrics.total_revenue >= 1000 THEN 'high_value'
      WHEN user_metrics.total_revenue >= 100 THEN 'medium_value'  
      WHEN user_metrics.total_revenue > 0 THEN 'low_value'
      ELSE 'non_revenue'
    END as user_segment,

    -- Customer lifecycle classification
    CASE 
      WHEN user_metrics.days_since_first_event <= 30 THEN 'new'
      WHEN user_metrics.days_since_last_event <= 30 THEN 'active'
      WHEN user_metrics.days_since_last_event <= 90 THEN 'dormant'
      ELSE 'inactive'  
    END as customer_lifecycle_stage,

    -- Behavioral indicators
    user_metrics.total_events as user_total_events,
    user_metrics.total_revenue as user_total_revenue,
    user_metrics.avg_session_duration as user_avg_session_duration,
    user_metrics.days_since_first_event,
    user_metrics.days_since_last_event,

    -- Revenue and value calculations
    CASE 
      WHEN CAST(event_data->>'quantity' AS INTEGER) > 0 THEN
        CAST(event_data->>'revenue' AS DECIMAL(10,2)) / CAST(event_data->>'quantity' AS INTEGER)
      ELSE 0
    END as average_order_value,

    -- Processing metadata
    CURRENT_TIMESTAMP as processed_at,
    'advanced_pipeline_v2' as processing_version

  FROM raw_events re
  LEFT JOIN user_behavioral_metrics user_metrics ON re.user_id = user_metrics.user_id
  WHERE 
    re.event_timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    AND re.processing_status = 'pending'
),

comprehensive_aggregation AS (
  SELECT 
    hour_bucket,
    event_type,
    event_category,
    user_segment,
    customer_lifecycle_stage,
    device_category,
    geo_region,
    browser,
    operating_system,

    -- Volume metrics with advanced calculations
    COUNT(*) as event_count,
    COUNT(DISTINCT user_id) as unique_users,
    COUNT(DISTINCT session_id) as unique_sessions,
    COUNT(DISTINCT product_id) as unique_products,
    COUNT(DISTINCT campaign_id) as unique_campaigns,

    -- User engagement metrics
    ROUND(COUNT(*)::DECIMAL / COUNT(DISTINCT user_id), 2) as events_per_user,
    ROUND(COUNT(*)::DECIMAL / COUNT(DISTINCT session_id), 2) as events_per_session,
    ROUND(COUNT(DISTINCT session_id)::DECIMAL / COUNT(DISTINCT user_id), 2) as sessions_per_user,

    -- Revenue analytics with statistical functions
    SUM(revenue) as total_revenue,
    SUM(quantity) as total_quantity,
    COUNT(*) FILTER (WHERE revenue > 0) as revenue_transactions,

    -- Advanced revenue statistics
    AVG(revenue) as avg_revenue,
    AVG(revenue) FILTER (WHERE revenue > 0) as avg_revenue_per_transaction,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY revenue) as median_revenue,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY revenue) as percentile_75_revenue,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY revenue) as percentile_95_revenue,
    STDDEV_POP(revenue) as revenue_standard_deviation,

    -- Advanced order value analytics
    AVG(average_order_value) as avg_order_value,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY average_order_value) as median_order_value,
    MAX(average_order_value) as max_order_value,
    MIN(average_order_value) FILTER (WHERE average_order_value > 0) as min_order_value,

    -- Conversion and engagement metrics
    ROUND((COUNT(*) FILTER (WHERE revenue > 0)::DECIMAL / COUNT(*)) * 100, 2) as conversion_rate_percent,
    SUM(revenue) / NULLIF(COUNT(DISTINCT user_id), 0) as revenue_per_user,
    SUM(revenue) / NULLIF(COUNT(DISTINCT session_id), 0) as revenue_per_session,

    -- Time-based analytics
    MIN(event_timestamp) as window_start_time,
    MAX(event_timestamp) as window_end_time,
    EXTRACT(MINUTES FROM (MAX(event_timestamp) - MIN(event_timestamp))) as processing_window_minutes,

    -- User behavior pattern analysis
    AVG(user_total_events) as avg_user_lifetime_events,
    AVG(user_total_revenue) as avg_user_lifetime_revenue,
    AVG(user_avg_session_duration) as avg_session_duration_seconds,
    AVG(days_since_first_event) as avg_days_since_first_event,

    -- Product performance metrics
    MODE() WITHIN GROUP (ORDER BY product_id) as top_product_id,
    MODE() WITHIN GROUP (ORDER BY product_name) as top_product_name,
    COUNT(DISTINCT product_id) FILTER (WHERE revenue > 0) as converting_products,

    -- Campaign effectiveness
    MODE() WITHIN GROUP (ORDER BY campaign_id) as top_campaign_id,
    COUNT(DISTINCT campaign_id) FILTER (WHERE revenue > 0) as converting_campaigns,

    -- Seasonal and temporal patterns
    MODE() WITHIN GROUP (ORDER BY day_of_week) as most_active_day_of_week,
    MODE() WITHIN GROUP (ORDER BY event_hour) as most_active_hour,

    -- Data quality indicators
    COUNT(*) FILTER (WHERE revenue IS NOT NULL) / COUNT(*)::DECIMAL as revenue_data_completeness,
    COUNT(*) FILTER (WHERE product_id IS NOT NULL) / COUNT(*)::DECIMAL as product_data_completeness,
    COUNT(*) FILTER (WHERE geo_region != 'Other') / COUNT(*)::DECIMAL as location_data_completeness

  FROM event_enrichment
  GROUP BY 
    hour_bucket, event_type, event_category, user_segment, customer_lifecycle_stage,
    device_category, geo_region, browser, operating_system
),

performance_analysis AS (
  SELECT 
    ca.*,

    -- Performance indicators and rankings
    ROW_NUMBER() OVER (ORDER BY total_revenue DESC) as revenue_rank,
    ROW_NUMBER() OVER (ORDER BY unique_users DESC) as user_engagement_rank,
    ROW_NUMBER() OVER (ORDER BY conversion_rate_percent DESC) as conversion_rank,
    ROW_NUMBER() OVER (ORDER BY avg_order_value DESC) as aov_rank,

    -- Efficiency metrics
    ROUND(total_revenue / processing_window_minutes, 2) as revenue_velocity_per_minute,
    ROUND(event_count::DECIMAL / processing_window_minutes, 1) as event_velocity_per_minute,
    ROUND(unique_users::DECIMAL / processing_window_minutes, 1) as user_acquisition_rate_per_minute,

    -- Business health indicators
    CASE 
      WHEN conversion_rate_percent >= 5.0 THEN 'excellent'
      WHEN conversion_rate_percent >= 2.0 THEN 'good'
      WHEN conversion_rate_percent >= 1.0 THEN 'fair'
      ELSE 'poor'
    END as conversion_performance_rating,

    CASE 
      WHEN revenue_per_user >= 100 THEN 'high_value'
      WHEN revenue_per_user >= 25 THEN 'medium_value'
      WHEN revenue_per_user >= 5 THEN 'low_value'
      ELSE 'minimal_value'
    END as user_value_rating,

    CASE 
      WHEN events_per_user >= 10 THEN 'highly_engaged'
      WHEN events_per_user >= 5 THEN 'moderately_engaged'
      WHEN events_per_user >= 2 THEN 'lightly_engaged'
      ELSE 'minimally_engaged'
    END as user_engagement_rating,

    -- Trend and growth indicators
    LAG(total_revenue) OVER (
      PARTITION BY event_type, user_segment, device_category, geo_region 
      ORDER BY hour_bucket
    ) as prev_hour_revenue,

    LAG(unique_users) OVER (
      PARTITION BY event_type, user_segment, device_category, geo_region 
      ORDER BY hour_bucket
    ) as prev_hour_users,

    LAG(conversion_rate_percent) OVER (
      PARTITION BY event_type, user_segment, device_category, geo_region 
      ORDER BY hour_bucket
    ) as prev_hour_conversion_rate

  FROM comprehensive_aggregation ca
),

trend_analysis AS (
  SELECT 
    pa.*,

    -- Revenue trends
    CASE 
      WHEN prev_hour_revenue IS NOT NULL AND prev_hour_revenue > 0 THEN
        ROUND(((total_revenue - prev_hour_revenue) / prev_hour_revenue * 100), 1)
      ELSE NULL
    END as revenue_change_percent,

    -- User acquisition trends
    CASE 
      WHEN prev_hour_users IS NOT NULL AND prev_hour_users > 0 THEN
        ROUND(((unique_users - prev_hour_users)::DECIMAL / prev_hour_users * 100), 1)
      ELSE NULL
    END as user_growth_percent,

    -- Conversion optimization trends
    CASE 
      WHEN prev_hour_conversion_rate IS NOT NULL THEN
        ROUND((conversion_rate_percent - prev_hour_conversion_rate), 2)
      ELSE NULL
    END as conversion_rate_change,

    -- Growth classification
    CASE 
      WHEN prev_hour_revenue IS NOT NULL AND total_revenue > prev_hour_revenue * 1.1 THEN 'high_growth'
      WHEN prev_hour_revenue IS NOT NULL AND total_revenue > prev_hour_revenue * 1.05 THEN 'moderate_growth'
      WHEN prev_hour_revenue IS NOT NULL AND total_revenue >= prev_hour_revenue * 0.95 THEN 'stable'
      WHEN prev_hour_revenue IS NOT NULL AND total_revenue >= prev_hour_revenue * 0.9 THEN 'moderate_decline'
      WHEN prev_hour_revenue IS NOT NULL THEN 'significant_decline'
      ELSE 'insufficient_data'
    END as revenue_trend_classification,

    -- Anomaly detection indicators
    CASE 
      WHEN conversion_rate_percent > (AVG(conversion_rate_percent) OVER () + 2 * STDDEV_POP(conversion_rate_percent) OVER ()) THEN 'conversion_anomaly_high'
      WHEN conversion_rate_percent < (AVG(conversion_rate_percent) OVER () - 2 * STDDEV_POP(conversion_rate_percent) OVER ()) THEN 'conversion_anomaly_low'
      WHEN revenue_per_user > (AVG(revenue_per_user) OVER () + 2 * STDDEV_POP(revenue_per_user) OVER ()) THEN 'revenue_anomaly_high'
      WHEN revenue_per_user < (AVG(revenue_per_user) OVER () - 2 * STDDEV_POP(revenue_per_user) OVER ()) THEN 'revenue_anomaly_low'
      ELSE 'normal'
    END as anomaly_detection_status

  FROM performance_analysis pa
),

insights_and_recommendations AS (
  SELECT 
    ta.*,

    -- Strategic insights
    ARRAY[
      CASE WHEN conversion_performance_rating = 'excellent' THEN 'Maintain current conversion optimization strategies' END,
      CASE WHEN conversion_performance_rating = 'poor' THEN 'Implement conversion rate optimization initiatives' END,
      CASE WHEN user_value_rating = 'high_value' THEN 'Focus on retention and upselling strategies' END,
      CASE WHEN user_value_rating = 'minimal_value' THEN 'Develop user value enhancement programs' END,
      CASE WHEN revenue_trend_classification = 'high_growth' THEN 'Scale successful channels and campaigns' END,
      CASE WHEN revenue_trend_classification = 'significant_decline' THEN 'Investigate and address performance issues urgently' END,
      CASE WHEN anomaly_detection_status LIKE '%anomaly%' THEN 'Investigate anomalous behavior for opportunities or issues' END
    ]::TEXT[] as strategic_recommendations,

    -- Operational recommendations
    ARRAY[
      CASE WHEN event_velocity_per_minute > 1000 THEN 'Consider increasing processing capacity' END,
      CASE WHEN revenue_data_completeness < 0.9 THEN 'Improve data collection completeness' END,
      CASE WHEN location_data_completeness < 0.8 THEN 'Enhance geographic data capture' END,
      CASE WHEN processing_window_minutes > 60 THEN 'Optimize data pipeline performance' END
    ]::TEXT[] as operational_recommendations,

    -- Priority scoring for resource allocation
    CASE 
      WHEN total_revenue >= 10000 AND conversion_rate_percent >= 3.0 THEN 10  -- Highest priority
      WHEN total_revenue >= 5000 AND conversion_rate_percent >= 2.0 THEN 8
      WHEN total_revenue >= 1000 AND conversion_rate_percent >= 1.0 THEN 6
      WHEN unique_users >= 1000 THEN 4
      ELSE 2
    END as business_priority_score,

    -- Investment recommendations
    CASE 
      WHEN business_priority_score >= 8 THEN 'High investment recommended'
      WHEN business_priority_score >= 6 THEN 'Moderate investment recommended'
      WHEN business_priority_score >= 4 THEN 'Selective investment recommended'
      ELSE 'Monitor performance'
    END as investment_recommendation

  FROM trend_analysis ta
)

-- Final comprehensive analytics output with actionable insights
SELECT 
  -- Core dimensions
  hour_bucket,
  event_type,
  event_category,
  user_segment,
  customer_lifecycle_stage,
  device_category,
  geo_region,

  -- Volume and engagement metrics
  event_count,
  unique_users,
  unique_sessions,
  events_per_user,
  events_per_session,
  sessions_per_user,

  -- Revenue analytics
  ROUND(total_revenue, 2) as total_revenue,
  revenue_transactions,
  ROUND(avg_revenue_per_transaction, 2) as avg_revenue_per_transaction,
  ROUND(median_revenue, 2) as median_revenue,
  ROUND(percentile_95_revenue, 2) as percentile_95_revenue,
  ROUND(revenue_per_user, 2) as revenue_per_user,
  ROUND(revenue_per_session, 2) as revenue_per_session,

  -- Performance indicators
  conversion_rate_percent,
  ROUND(avg_order_value, 2) as avg_order_value,
  conversion_performance_rating,
  user_value_rating,
  user_engagement_rating,

  -- Trend analysis
  revenue_change_percent,
  user_growth_percent,
  conversion_rate_change,
  revenue_trend_classification,
  anomaly_detection_status,

  -- Business metrics
  business_priority_score,
  investment_recommendation,

  -- Performance rankings
  revenue_rank,
  user_engagement_rank,
  conversion_rank,

  -- Operational metrics
  ROUND(revenue_velocity_per_minute, 2) as revenue_velocity_per_minute,
  ROUND(event_velocity_per_minute, 1) as event_velocity_per_minute,
  processing_window_minutes,

  -- Data quality
  ROUND(revenue_data_completeness * 100, 1) as revenue_data_completeness_percent,
  ROUND(product_data_completeness * 100, 1) as product_data_completeness_percent,
  ROUND(location_data_completeness * 100, 1) as location_data_completeness_percent,

  -- Top performing entities
  top_product_name,
  top_campaign_id,
  most_active_hour,

  -- Strategic insights
  strategic_recommendations,
  operational_recommendations,

  -- Metadata
  window_start_time,
  window_end_time,
  CURRENT_TIMESTAMP as analysis_generated_at

FROM insights_and_recommendations
WHERE 
  -- Filter for significant segments to focus analysis
  (event_count >= 10 OR total_revenue >= 100 OR unique_users >= 5)
  AND business_priority_score >= 2
ORDER BY 
  business_priority_score DESC,
  total_revenue DESC,
  hour_bucket DESC;

-- Real-time streaming analytics with change stream processing
CREATE STREAMING_ANALYTICS_VIEW real_time_conversion_funnel AS
WITH funnel_events AS (
  SELECT 
    user_id,
    session_id,
    event_type,
    event_timestamp,
    revenue,

    -- Create event sequence within sessions
    ROW_NUMBER() OVER (
      PARTITION BY user_id, session_id 
      ORDER BY event_timestamp
    ) as event_sequence,

    -- Identify funnel steps
    CASE event_type
      WHEN 'page_view' THEN 1
      WHEN 'product_view' THEN 2  
      WHEN 'add_to_cart' THEN 3
      WHEN 'checkout_start' THEN 4
      WHEN 'purchase' THEN 5
      ELSE 0
    END as funnel_step,

    -- Calculate time between events
    LAG(event_timestamp) OVER (
      PARTITION BY user_id, session_id 
      ORDER BY event_timestamp
    ) as prev_event_timestamp

  FROM CHANGE_STREAM('raw_events')
  WHERE event_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 hour'
  AND event_type IN ('page_view', 'product_view', 'add_to_cart', 'checkout_start', 'purchase')
),

real_time_funnel_analysis AS (
  SELECT 
    DATE_TRUNC('minute', event_timestamp) as minute_bucket,

    -- Funnel step counts
    COUNT(*) FILTER (WHERE funnel_step = 1) as step_1_page_views,
    COUNT(*) FILTER (WHERE funnel_step = 2) as step_2_product_views,  
    COUNT(*) FILTER (WHERE funnel_step = 3) as step_3_add_to_cart,
    COUNT(*) FILTER (WHERE funnel_step = 4) as step_4_checkout_start,
    COUNT(*) FILTER (WHERE funnel_step = 5) as step_5_purchase,

    -- Unique user counts at each step
    COUNT(DISTINCT user_id) FILTER (WHERE funnel_step = 1) as unique_users_step_1,
    COUNT(DISTINCT user_id) FILTER (WHERE funnel_step = 2) as unique_users_step_2,
    COUNT(DISTINCT user_id) FILTER (WHERE funnel_step = 3) as unique_users_step_3,
    COUNT(DISTINCT user_id) FILTER (WHERE funnel_step = 4) as unique_users_step_4,
    COUNT(DISTINCT user_id) FILTER (WHERE funnel_step = 5) as unique_users_step_5,

    -- Revenue metrics
    SUM(revenue) FILTER (WHERE funnel_step = 5) as total_revenue,
    AVG(revenue) FILTER (WHERE funnel_step = 5 AND revenue > 0) as avg_purchase_value,

    -- Timing analysis
    AVG(EXTRACT(SECONDS FROM (event_timestamp - prev_event_timestamp))) FILTER (
      WHERE prev_event_timestamp IS NOT NULL 
      AND EXTRACT(SECONDS FROM (event_timestamp - prev_event_timestamp)) <= 3600
    ) as avg_time_between_steps_seconds

  FROM funnel_events
  WHERE funnel_step > 0
  GROUP BY DATE_TRUNC('minute', event_timestamp)
)

SELECT 
  minute_bucket,

  -- Funnel volumes
  step_1_page_views,
  step_2_product_views,
  step_3_add_to_cart, 
  step_4_checkout_start,
  step_5_purchase,

  -- Conversion rates between steps
  ROUND((step_2_product_views::DECIMAL / NULLIF(step_1_page_views, 0)) * 100, 2) as page_to_product_rate,
  ROUND((step_3_add_to_cart::DECIMAL / NULLIF(step_2_product_views, 0)) * 100, 2) as product_to_cart_rate,
  ROUND((step_4_checkout_start::DECIMAL / NULLIF(step_3_add_to_cart, 0)) * 100, 2) as cart_to_checkout_rate,
  ROUND((step_5_purchase::DECIMAL / NULLIF(step_4_checkout_start, 0)) * 100, 2) as checkout_to_purchase_rate,

  -- Overall funnel performance
  ROUND((step_5_purchase::DECIMAL / NULLIF(step_1_page_views, 0)) * 100, 2) as overall_conversion_rate,

  -- User journey efficiency
  ROUND((unique_users_step_5::DECIMAL / NULLIF(unique_users_step_1, 0)) * 100, 2) as user_conversion_rate,

  -- Revenue performance
  ROUND(total_revenue, 2) as total_revenue,
  ROUND(avg_purchase_value, 2) as avg_purchase_value,
  ROUND(total_revenue / NULLIF(unique_users_step_5, 0), 2) as revenue_per_converting_user,

  -- Efficiency metrics
  ROUND(avg_time_between_steps_seconds / 60.0, 1) as avg_minutes_between_steps,

  -- Performance indicators
  CASE 
    WHEN overall_conversion_rate >= 5.0 THEN 'excellent'
    WHEN overall_conversion_rate >= 2.0 THEN 'good'
    WHEN overall_conversion_rate >= 1.0 THEN 'fair'
    ELSE 'needs_improvement'
  END as funnel_performance_rating,

  -- Real-time alerts
  CASE 
    WHEN overall_conversion_rate < 0.5 THEN 'LOW_CONVERSION_ALERT'
    WHEN avg_time_between_steps_seconds > 300 THEN 'SLOW_FUNNEL_ALERT'  
    WHEN step_5_purchase = 0 AND step_4_checkout_start > 5 THEN 'CHECKOUT_ISSUE_ALERT'
    ELSE 'normal'
  END as real_time_alert_status,

  CURRENT_TIMESTAMP as analysis_timestamp

FROM real_time_funnel_analysis
WHERE minute_bucket >= CURRENT_TIMESTAMP - INTERVAL '30 minutes'
ORDER BY minute_bucket DESC;

-- QueryLeaf provides comprehensive data pipeline capabilities:
-- 1. SQL-familiar syntax for MongoDB aggregation pipeline construction
-- 2. Advanced real-time stream processing with Change Streams integration
-- 3. Comprehensive multi-dimensional analytics with statistical functions
-- 4. Built-in performance optimization and index utilization
-- 5. Real-time anomaly detection and business intelligence
-- 6. Advanced funnel analysis and conversion optimization
-- 7. Sophisticated trend analysis and predictive indicators
-- 8. Enterprise-ready data governance and compliance features
-- 9. Automated insights generation and recommendation systems
-- 10. Production-ready stream processing with minimal configuration

Best Practices for Production Data Pipeline Implementation

Pipeline Architecture Design Principles

Essential principles for effective MongoDB data pipeline deployment:

Multi-Tier Processing Strategy: Implement real-time, near-real-time, and batch processing tiers based on latency and consistency requirements
Performance Optimization: Design aggregation pipelines with proper indexing, stage ordering, and memory optimization for maximum throughput
Stream Processing Integration: Leverage Change Streams for real-time processing while maintaining batch processing for historical analysis
Data Quality Management: Implement comprehensive data validation, cleansing, and quality monitoring throughout the pipeline
Scalability Planning: Design pipelines that can scale horizontally and handle increasing data volumes and processing complexity
Monitoring and Alerting: Establish comprehensive pipeline monitoring with performance metrics and business-critical alerting

Enterprise Data Pipeline Architecture

Design pipeline systems for enterprise-scale requirements:

Advanced Analytics Integration: Implement sophisticated analytical capabilities including predictive analytics and machine learning integration
Data Governance Framework: Establish data lineage tracking, compliance monitoring, and audit trail maintenance
Performance Monitoring: Implement comprehensive performance tracking with optimization recommendations and capacity planning
Security and Compliance: Design secure pipelines with encryption, access controls, and regulatory compliance features
Operational Excellence: Integrate with existing monitoring systems and establish operational procedures for pipeline management
Disaster Recovery: Implement pipeline resilience with failover capabilities and data recovery procedures

Conclusion

MongoDB data pipeline optimization and stream processing provide comprehensive real-time analytics capabilities that enable sophisticated data transformations, high-performance analytical workloads, and intelligent business insights through native aggregation framework optimization, integrated change stream processing, and advanced statistical functions. The unified platform approach eliminates the complexity of managing separate batch and stream processing systems while delivering superior performance and operational simplicity.

Key MongoDB Data Pipeline benefits include:

Real-Time Processing: Advanced Change Streams integration for immediate data processing and real-time analytics generation
Comprehensive Analytics: Sophisticated aggregation framework with advanced statistical functions and multi-dimensional analysis capabilities
Performance Optimization: Native query optimization, intelligent indexing, and memory management for maximum throughput
Stream and Batch Integration: Unified platform supporting both real-time stream processing and comprehensive batch analytics
Business Intelligence: Advanced analytics with anomaly detection, trend analysis, and automated insights generation
SQL Accessibility: Familiar SQL-style data pipeline operations through QueryLeaf for accessible advanced analytics

Whether you're building real-time dashboards, implementing complex analytical workloads, processing high-velocity data streams, or developing sophisticated business intelligence systems, MongoDB data pipeline optimization with QueryLeaf's familiar SQL interface provides the foundation for scalable, high-performance data processing workflows.

QueryLeaf Integration: QueryLeaf automatically optimizes MongoDB aggregation pipelines while providing SQL-familiar syntax for complex analytics, stream processing, and data transformation operations. Advanced pipeline construction, performance optimization, and business intelligence features are seamlessly handled through familiar SQL constructs, making sophisticated data processing accessible to SQL-oriented analytics teams.

The combination of MongoDB's powerful aggregation framework with SQL-style data pipeline operations makes it an ideal platform for applications requiring both advanced analytical capabilities and familiar database interaction patterns, ensuring your data processing workflows can scale efficiently while delivering actionable business insights in real-time.