Stay humble. Stay hungry. Stay foolish.

Consistent Hashing & Design Tiny URL

Written in

by

Consistent Hashing (Sharding)

  1. Basic Sharding: Integer Modulo X % N.
    Inconsistent. The reminder changes sharply when N changes. Lots of data migration pressure.
  2. Consistent Hashing
    1. Basic Solution:
      1. Divide the circle into 2^64 points.
      2. Evenly distribute the machine on the circle.
      3. Calculate the corresponding data point on the circle based on the data-id (fixed between searches).
      4. Data request finds the next available machine starting from the along the circle (clockwise).
    2.  Improved Solution (Micro shards / Virtual Nodes)
      1. Map N machines with 1000N virtual node. 
      2. Introduce 1000N virtual nodes and randomly / evenly distribute the virtual nodes on the circle.
      3. Data request finds the next available virtual nodes and finds the corresponding machine. (Binary search)
    3. Implementation:
      1. It doesn’t need to store the data point but need to store the machine point.
      2. Use treemap (red-black-tree). Find the next available machine point with time complexity of $\latex O(logN)$.
    4.   Mantannice:
      1. Add a new machine.
        1. Randomly add 1000 virtual nodes. Ask for all the next available virtual nodes to migrate data.
      2. Delete an old machine.
        1. Migrate the data to the next available virtual nodes.

Replica

  1. Backup vs Replica
    1. Backup: offline / periodical / doesn’t read (Not busy time, Cheap)
    2. Replica: online / same-time / read (Runtime, Expensive)
  2. MySQL Replica: Master-slave
    1. Master write. Slave read. Slave sync from the master.
    2. (Comes with the feature) Write ahead log
      1. SQL logs all operations.
      2. When a slave is active, it informs the master.
      3. The master notifies the slave to read the log. (Sync log not value) (The user only waits for the write to master, not the write to slave).
      4. The slaves always have a delay.
      5. If the master dies, the slave will be promoted to master. Slight data loss and inconsistency.
  3. NoSQL: Find three virtual nodes in the consistent hashing ring and write three times.
  4. Difference:
    1. SQL:
      1. Auto Replica – Master-Slave
      2. Manual Replica – Consistent Hashing triple stores
    2. NoSQL:
      1. Auto Replica – Consistent Hashing triple stores
      2. Manual Replica – Not necessary

Design Tiny URL

  1. Workflow
    1. Request to tiny URL service
    2. Redirect to target website
    3. Go to target website
  2.  Scenario
    1. 100 DAU
    2. Publish QPS = 100M * 0.1 / 86,400 = 100; Peak = 200
    3. Read QPS = 100M * 1 / 86,400 = 1000; Peak = 2k
    4. Storeage: 100M * 0.1 * 100bytes = 1G per day
  3.  Serive
    1. UrlService.encode -> GET
    2. UrlService.decode -> POST
  4. Storage:
    1. No Transaction
    2. No SQL Features
    3. No Complicated
    4. No Sequential ID (MySQL provides incremental ID. NoSQL doesn’t provide)
  5. Algorithm:
    1. Hash function.
      1. Pros: Fast
      2. Cons: Conflict
    2. Random generation. Remove duplicates.
      1. Pros: Easy implementation
      2. Cons: Performance gets slower after more URLs.
      3. Implementation: ShortKey, LongUrl. Double Index. (NoSQL, row_key + column_key)
    3. Transfer to Base62
      1. Base 62 (0-9, a-z, A-Z)
      2. Pros: Efficient
      3. Cons: Relay on global incremental ID. (SQL)
      4. Implementation: SQL
  6. Scale:
    1. Performance:
      1. Cache
        1. Cache aside.
        2. Look shortUrl in the cache. If it doesn’t find it, read the data in the database.
      2. Location-Based
        1. Server:
          1. DNS matches users with servers at different locations
        2. Database:
          1. Centralized Database but distributed Memcached.
          2. Webserver access database server is faster than the user access database server.
      3. Multiple Database:
        1. Purpose: Solve QPS problem. Not storage problem.
        2. Horizontal Sharding.
          1. Problem:
            Sharding based short. How to determine where is long.
            Sharding based long. How to determine where is short.
          2. Solution:
            1. Duplicate
              1. S -> L. S as sharding key.
              2. L -> S. L as sharding key.
            2. One long URL maps to multiple short URLs
            3. Extend the short URL.
              1. Add the long URL sharding key as one digit.
              2. Use 62 virtual nodes for sharding.
      4. Multi-Region
        1. Sharing based on location of the service.
        2. Store hot short URLs in the nearby database server. – Training
    2. Timeout
      1. Better not.
      2. If need to timeout, use timeout service.
    3. Custom URL
      1. Create a custom URL table. Mix with a random short URL.

 

Tags

Leave a comment