Building a Respectful Web Scraper: Lessons from Scraping Meetup Events Link to heading

Web scraping often gets a bad reputation in the developer community—and for good reason. Too many scrapers are built without consideration for the target site’s resources, terms of service, or basic web etiquette. When I needed to gather event data from Meetup.com for a personal project, I decided to build a scraper that would be both effective and respectful. Here’s what I learned along the way.

The Problem: Dynamic Content and Infinite Scroll Link to heading

Meetup’s event search pages present several interesting technical challenges that make them a perfect case study for modern web scraping:

JavaScript-heavy rendering - The content is dynamically generated, so traditional HTTP requests won’t work
Infinite scroll loading - Events load progressively as you scroll down
Complex DOM structure - Event data is scattered across multiple nested elements
Anti-bot measures - The site has various detection mechanisms in place

Architecture Overview Link to heading

I built the scraper using Python with Selenium WebDriver, structured as a single class that handles all aspects of the scraping process:

class MeetupScraper:
   def __init__(self):
       self.setup_driver()
       self.check_robots_txt()
   
   def scrape_events(self, url: str, max_pages: int = 3, exhaustive: bool = False):
       # Main scraping logic
   
   def extract_event_info(self, event_element):
       # Parse individual event data

The beauty of this approach is its simplicity—everything is contained within a single, focused class that can be easily tested and extended.

Designing for Pipeline Integration Link to heading

This scraper wasn’t built as a standalone tool—it’s designed to be the first component in an automated email digest pipeline. The key insight here is creating a clean contract between pipeline stages through well-structured JSON output.

By outputting standardized JSON, the scraper establishes a clear boundary between data collection and data processing. The downstream emailer container doesn’t need to know anything about Selenium, web scraping challenges, or DOM parsing—it simply reads structured event data and focuses on its own responsibilities: AI summarization and email delivery.

This separation makes perfect sense for containerized deployments. Web scraping has unique requirements—headless browsers, anti-detection measures, and significant memory usage. Email generation has different needs—API integrations, template processing, and SMTP handling. By keeping them in separate containers, each can be optimized for its specific task and scaled independently.

The clean JSON contract also enables easier testing and development. You can mock the scraper’s output to test the emailer, or run the scraper standalone to validate data collection without triggering email sends.

Respecting robots.txt: The Right Way to Start Link to heading

One of the first things my scraper does is check the target site’s robots.txt file:

def check_robots_txt(self, url: str) -> bool:
    try:
        parsed_url = urlparse(url)
        robots_url = f"{parsed_url.scheme}://{parsed_url.netloc}/robots.txt"
        
        rp = RobotFileParser()
        rp.set_url(robots_url)
        rp.read()
        
        can_fetch = rp.can_fetch(self.user_agent, url)
        
        if not can_fetch:
            logger.warning(f"Scraping not allowed for {url}")
            return False
            
        # Respect crawl delays
        crawl_delay = rp.crawl_delay(self.user_agent)
        if crawl_delay:
            time.sleep(crawl_delay)
            
        return True
    except Exception as e:
        logger.error(f"Error checking robots.txt: {str(e)}")
        return False

This isn’t just about being polite—it’s about being professional. If a site explicitly disallows scraping in their robots.txt, my scraper respects that and exits gracefully.

Handling Dynamic Content with Selenium Link to heading

For JavaScript-heavy sites like Meetup, Selenium is often the best choice despite its overhead. The key is configuring it properly to avoid detection while maintaining good performance:

def setup_driver(self):
    chrome_options = Options()
    
    # Essential options for headless operation
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    
    # Anti-detection measures
    chrome_options.add_argument('--disable-blink-features=AutomationControlled')
    chrome_options.add_experimental_option('excludeSwitches', ['enable-automation'])
    chrome_options.add_experimental_option('useAutomationExtension', False)
    
    # Custom user agent
    chrome_options.add_argument(f'--user-agent={self.user_agent}')

I also use Chrome DevTools Protocol (CDP) commands to further mask the automation:

self.driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
    'source': '''
        Object.defineProperty(navigator, 'webdriver', {
            get: () => undefined
        })
    '''
})

The Infinite Scroll Challenge Link to heading

Meetup’s infinite scroll implementation required a careful balance between thoroughness and performance. My solution tracks processed event IDs to avoid duplicates and implements both limited and exhaustive scraping modes:

def scrape_events(self, url: str, max_pages: int = 3, exhaustive: bool = False):
    events = []
    processed_ids = set()
    page = 0

    while True:
        if not exhaustive and page >= max_pages:
            break
        
        # Scroll and wait for content
        self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(random.uniform(2, 4))
        
        # Find new events
        event_elements = self.driver.find_elements(By.CSS_SELECTOR, '[data-event-id]')
        found_new_events = False
        
        for event_element in event_elements:
            event_id = event_element.get_attribute('data-event-id')
            if event_id not in processed_ids:
                # Process new event
                found_new_events = True
        
        # Stop if no new events found
        if not found_new_events:
            break

The random delays between actions (random.uniform(2, 4)) help mimic human behavior and reduce server load.

Robust Data Extraction Link to heading

The event data extraction logic handles the messiness of real-world HTML gracefully:

def extract_event_info(self, event_element):
    try:
        # Required fields
        event_id = event_element.get_attribute('data-event-id')
        title = event_element.find_element(By.CSS_SELECTOR, 'h3').text
        
        # Optional fields with fallbacks
        try:
            rating_container = event_element.find_element(By.CSS_SELECTOR, '[class*="text-ds-neutral500"]')
            rating = rating_container.find_element(By.CSS_SELECTOR, 'span').text
        except NoSuchElementException:
            rating = "No rating"
            
        # More extraction logic...
        
    except Exception as e:
        logger.error(f"Error extracting event info: {str(e)}")
        return None

Every optional field has a fallback value, ensuring the scraper doesn’t crash on edge cases while still collecting as much data as possible.

Docker Support for Consistent Deployment Link to heading

Running Selenium in containers can be tricky, but it’s worth it for consistent deployments. The key insight is detecting the environment and configuring the driver accordingly:

if os.path.exists('/usr/bin/chromedriver'):
    # Docker environment
    chrome_options.binary_location = '/usr/bin/chromium-browser'
    service = Service('/usr/bin/chromedriver')
else:
    # Local environment
    service = Service(ChromeDriverManager().install())

This allows the same code to run both locally (with automatic ChromeDriver management) and in Docker containers.

Performance and Rate Limiting Link to heading

The scraper includes several mechanisms to be respectful of Meetup’s servers:

Random delays between actions (1-4 seconds)
Robots.txt compliance including crawl delay respect
Proper user agent identification
Graceful error handling that doesn’t retry aggressively
Configurable limits on scraping depth

Real-World Results Link to heading

The scraper successfully extracts comprehensive event data:

[
  {
    "event_id": "307937022",
    "title": "Python x OpenAI Workshop",
    "url": "https://www.meetup.com/...",
    "date": "2025-06-14T09:30:00-05:00",
    "date_display": "Sat, Jun 14 · 9:30 AM CDT",
    "location": "Online",
    "group_name": "Tech Founders Club",
    "rating": "4.5",
    "attendees": 46,
    "image_url": "https://secure.meetupstatic.com/..."
  }
]

Lessons Learned Link to heading

Respect comes first - Always check robots.txt and implement reasonable rate limiting
Flexibility beats perfection - Handle missing data gracefully rather than failing completely
Environment awareness - Build for both local development and production deployment
Logging is crucial - Good logging makes debugging production issues much easier
Random delays work - They help avoid detection and reduce server load
Design for pipelines - Clean JSON output makes integration with downstream systems seamless

Looking Forward Link to heading

Web scraping doesn’t have to be a dark art. With the right approach, you can build tools that are both effective and respectful—exactly what the web needs more of.

The complete source code for this scraper is available on GitHub. This scraper is part of a larger automated email digest pipeline. Remember to always respect websites’ terms of service and rate limits when building your own scrapers.