Stay humble. Stay hungry. Stay foolish.

Design Web Crawler & Typeahead

Written in

by

Design Web Crawler

  1. Example:
    1. Google Search.
      1. Directed Graph.
      2. Vertex as a web page. Edges as URLs in the web page.
      3. BFS search (In order to search in parallel).
  2. Scenario:
    1. 1 Trillion web pages
      1. Crawl all of them every week -> Crawl 1.6M web pages per second
    2. 1 Petabyte web page storage
      1. The average size of a web page is 10kb
  3. Service
    1. Basic Workflow:
      1. Fetch web page.
      2. Regex match
        1. News Service: Regex match to get headlines “<h3[^>]*><a[^>]*>(.*?)<\/a><\/h3>”.
        2. Search Service: Regex match to get URLs.
    2. Single Thread Single Machine (BFS)
      1. Maintain a URL queue + HashSet (to remove duplicates)
      2. Web page loader pops URL from the front of the queue.
      3. URL extractor pushes multiple URLs to the end of the queue.
    3. Multi-Thread Single Machine
      1. Use consumer and producer queue.
        1. Sleep
        2. Conditional variable
        3. Semaphore
        4. CV and Semaphore
          1. They are built on mutual exclusion provided by locks.
          2. The conditional variable handles the problem of busy waiting. It is essentially a wait-queue controlled by the OS.
            1. Boolean.
            2. wait + signal / broadcast
          3. The semaphore handles the more complicated situation with a counter. It is essentially a counter and a wait queue controlled by the OS.
            1. Integer
            2. up / down + wait + signal / broadcast
      2. Multiple crawler threads using multiple TCP ports.
        1. Limitation on # of crawlers.
          1. Overhead: context switch cost, memory, etc.
          2. Number of ports (2 ^ 16 = 65536). Not really limiting.
          3. Network bottleneck for a single machine.
    4. Multi-Machine
      1. Using DB to store the queue (Task table, MySQL)
        1. Schema: ID, URL, State {idle, working, done}, Priority, Available Time (Control the frequency of fetching)
        2. Each machine asks for 1K URLs from the task table.
      2. Storage Service (NoSQL / File System)
  4. Scale
    1. Task table is large and access is slow
      1. Sharding
    2. Frequency of crawl
      1. Exponential back-off
      2. Freq x 2 or / 2
    3. Dead cycle:
      1. All crawlers working on the same website.
      2. Manually limit the compute resource for specific websites.
    4. Multi-region:
      1. Processing data at the region
      2. Sync data globally and periodically.

Design Typeahead

  1. Example
    1. Google Suggestion
      1. Prefix -> Top N Hot Keywords
    2. Twitter Typeahead
      1. Suggestion + User + Hashtag
  2. Scenario
    1. DAU: 500M
    2. Search: (Each user searches 6 times, types 4 letters) 6 * 4 * 500 = 12B
    3. QPS: 12B / 86400 = 138K
    4. Peak QPS: 138K * 2 = 276K
  3. Service
    1. Query Service + Data Collection Service
    2. Structure
      Request -> Trie (In Memory) + Serialized Trie (In Disk) <- DataCollection Service <- Log Data (Database)
    3. Query Solution
      1. SQL query with like syntax and order syntax
      2. SQL query with a prefix as key and the hot words as value
    4. Data Collection Solution
      1. User + Keyword + Timestamp
  4. Storage
    1. Database (Sharding) + Memcached
      1. SQL query with like syntax and order syntax
      2. SQL query with a prefix as key and the hot words as value
    2. In-memory Trie
      1. Trie: At each node, store the frequency of the word
      2. Trie: At each node, store the hot words with the prefix
  5. Scale
    1.  Update
      1. Update on the server that is not being accessed by the users
      2. Switch the server that is being accessed and updated
    2. Not Enough Memory
      1. Add multiple query service
        1. Divide based on the first character -> Strong bias
        2. Use consistent hashing (use the prefix as the key)
    3. Reduce the size of the log file
      1. User + Prefix + Timestamp
        1. Local Cache (Browser)
        2. Sampling
    4. Further performance improvement
      1. Local caching at browser (Javascript)
      2. Ask for more hot words, longer prefix still has a high probability of the result
        1. Enter a, ask for 1K words starts with a
        2. Enter ab, ask for 1K words starts with ab

Tags

Leave a comment