Concepts
Snitch
Snitch is how the nodes in a cluster know about the topology of cluster
PropertyFileSnitch: modify the property on every machine in the cluster
Gossip
Gossip is the internal communication method for nodes in a cluster to talk to each other
For external communication, such as from an application to a Cassandra database, CQL or Thrift are used.
Data Distribution
Data distribution is done through consistent hashing to strive for even distribution of data across the nodes in a cluster.
To distribute the row across the nodes, a partitioner is used.
The partitioner uses an algorithm to determine which node a given row of data will go to.
The deault partitioner in Cassandra is Murmur3
Murmur3 taks the value in the first column* of the row to genrerate a unique number between $$-2^{64}$$ and $$2^{63}$$
Each node in a cluster is assigned one token range (or multiple ranges with virtual noodes)
Calculate the token range
python
print [str(((2**64 / 4) * i) -2**63) for i in range(4)]
Each node is responsible for the token values between its endpoint and the endpoint of the previous node
Replication
A replication factor must be specified whenever a database is defined
Virtual nodes
Virtual nodes are an alternative way to assign token ranges to nodes, and are now the default in Cassandra
With virtual nodes, instead of a node being responsible for just one token range, it is instead responsible for many small token ranges (by default, 256 of them)
Virtual nodes (aka vnodes) were created to make it easier to add new nodes to a cluster while keeping the cluster balanced
When a new node is added, it receives many small token range slives from the existing nodes, to maintian a balanced cluster.
With the old way, of static token ranges, it was common to double the number of nodes so that the new nodes could be a value half of the value of the existing end-points