Design Web Crawler
- Example:
- Google Search.
- Directed Graph.
- Vertex as a web page. Edges as URLs in the web page.
- BFS search (In order to search in parallel).
- Google Search.
- Scenario:
- 1 Trillion web pages
- Crawl all of them every week -> Crawl 1.6M web pages per second
- 1 Petabyte web page storage
- The average size of a web page is 10kb
- 1 Trillion web pages
- Service
- Basic Workflow:
- Fetch web page.
- Regex match
- News Service: Regex match to get headlines “<h3[^>]*><a[^>]*>(.*?)<\/a><\/h3>”.
- Search Service: Regex match to get URLs.
- Single Thread Single Machine (BFS)
- Maintain a URL queue + HashSet (to remove duplicates)
- Web page loader pops URL from the front of the queue.
- URL extractor pushes multiple URLs to the end of the queue.
- Multi-Thread Single Machine
- Use consumer and producer queue.
- Sleep
- Conditional variable
- Semaphore
- CV and Semaphore
- They are built on mutual exclusion provided by locks.
- The conditional variable handles the problem of busy waiting. It is essentially a wait-queue controlled by the OS.
- Boolean.
- wait + signal / broadcast
- The semaphore handles the more complicated situation with a counter. It is essentially a counter and a wait queue controlled by the OS.
- Integer
- up / down + wait + signal / broadcast
- Multiple crawler threads using multiple TCP ports.
- Limitation on # of crawlers.
- Overhead: context switch cost, memory, etc.
- Number of ports (2 ^ 16 = 65536). Not really limiting.
- Network bottleneck for a single machine.
- Limitation on # of crawlers.
- Use consumer and producer queue.
- Multi-Machine
- Using DB to store the queue (Task table, MySQL)
- Schema: ID, URL, State {idle, working, done}, Priority, Available Time (Control the frequency of fetching)
- Each machine asks for 1K URLs from the task table.
- Storage Service (NoSQL / File System)
- Using DB to store the queue (Task table, MySQL)
- Basic Workflow:
- Scale
- Task table is large and access is slow
- Sharding
- Frequency of crawl
- Exponential back-off
- Freq x 2 or / 2
- Dead cycle:
- All crawlers working on the same website.
- Manually limit the compute resource for specific websites.
- Multi-region:
- Processing data at the region
- Sync data globally and periodically.
- Task table is large and access is slow
Design Typeahead
- Example
- Google Suggestion
- Prefix -> Top N Hot Keywords
- Twitter Typeahead
- Suggestion + User + Hashtag
- Google Suggestion
- Scenario
- DAU: 500M
- Search: (Each user searches 6 times, types 4 letters) 6 * 4 * 500 = 12B
- QPS: 12B / 86400 = 138K
- Peak QPS: 138K * 2 = 276K
- Service
- Query Service + Data Collection Service
- Structure
Request -> Trie (In Memory) + Serialized Trie (In Disk) <- DataCollection Service <- Log Data (Database) - Query Solution
- SQL query with like syntax and order syntax
- SQL query with a prefix as key and the hot words as value
- Data Collection Solution
- User + Keyword + Timestamp
- Storage
- Database (Sharding) + Memcached
- SQL query with like syntax and order syntax
- SQL query with a prefix as key and the hot words as value
- In-memory Trie
- Trie: At each node, store the frequency of the word
- Trie: At each node, store the hot words with the prefix
- Database (Sharding) + Memcached
- Scale
- Update
- Update on the server that is not being accessed by the users
- Switch the server that is being accessed and updated
- Not Enough Memory
- Add multiple query service
- Divide based on the first character -> Strong bias
- Use consistent hashing (use the prefix as the key)
- Add multiple query service
- Reduce the size of the log file
- User + Prefix + Timestamp
- Local Cache (Browser)
- Sampling
- User + Prefix + Timestamp
- Further performance improvement
- Local caching at browser (Javascript)
- Ask for more hot words, longer prefix still has a high probability of the result
- Enter a, ask for 1K words starts with a
- Enter ab, ask for 1K words starts with ab
- Update
Leave a comment