Hyperledger v1 Ledger High-level Design
Objectives •
Support v1 endorsement/consensus model - separation of simulation (chaincode execution) and block commit – Endorsement/simulation (chaincode execution) can be performed on a subset of peers. – Parallel execution of chaincode (concurrency) – Improved scalability
•
Embed transaction read/write sets on the blockchain (input-version and postimage) – Immutability, Auditing, Provenance
•
Optimize data storage for blockchain use pattern – New file-based ledger for improved performance – Continue using RocksDB for ‘indexes’ to optimize ledger queries
•
Support pluggable data stores and rich query language – Challenging, given the first objective – most databases do not support simulation and read/write set requirements. Limitations will be likely. Next priority for investigation...
Ledger - Current work focus
KV-ledger (High level components) • Block storage – Stores and retrieves blocks – Assumes blocks arrive in exact sequence – Queries supported • Retrieve blocks by block-hash and block-number • Scan blocks range between two block numbers • Retrieve Transaction by txId
• Transactions execution – Simulates transactions and produces ReadWriteSet (Endorser) • Queries/Updates – GetKey/SetKey/GetKeyRange – Validates And applies ReadWriteSet (Committer) • Key version based validation (MVCC) – Read-only queries • GetKey/GetKeyRange
Filesystem-based Block Storage •
Blocks are stored in file segments –
•
Each file segment contains – –
•
File segment header (version etc.) A sequence of • Varint encoded length of block-bytes followed by block-bytes
RocksDB contains block indexes to support common queries – – –
•
Default segment size 64 MB
Index block by hash, Index block by number, Index transaction by Id Value of index is a file-offset-pointer Potentially encode starting block number in segment file name, include a segment-specific block index at the end of each segment file, and use blockNumber_tranId for transaction id, so that you can easily jump to segment file given a block number or transaction id, without needing an external blockNum or txId index (would still need a blockHash external index)
Usage – –
Raw ledger – store ‘batches’ of raw transactions to be committed Final validated ledger – store committed ‘blocks’ of valid transactions
File seg-1
RocksDB block index
blockHash blockNum txId
SegNo + offset SegNo + offset SegNo + offset
File seg header Block-1 length
Block-1 Block-2 length
Block-2
Filesystem-based Block Storage • Pros – Blocks arrive in a sequential order resulting in efficient append-only workload – Avoids the write amplification associated with RocksDB and other storage solutions – Becomes more feasible to move large numbers of blocks in bulk, for example when a new peer comes online (move entire files instead of reading/writing N blocks).
• Cons – Custom block data management on file system – Need to maintain sanity of file segments and consistency between block files and RocksDB indexes • Need utilities to validate that block files and RocksDB are in sync, and to re-build indexes as needed
Logical structure of a RWSet Block{ Transactions [ { "Id" : txUUID1 "Invoke" : “Method(arg1, arg2,..,argN)" “TxRWSet" : [ { ”Chaincode” : “ccId” “Reads”:[{"key" : “key1", "version”:v1}] // if a Tx perform both read and write on a key, the key appears only in Writes “Writes”:[{"key" : “key2", "version”:v2, ”value" : bytes1}] // a missing value indicates a delete operation } // end chaincode RWSet ] // end TxRWSet }, // end transaction with "Id" txUUID1 { // another transaction }, ] // end Transactions }// end Block
JSON syntax only for conceptual representation Data is serialized in binary representation - sorted order of ccIds and sorted keys within chaincode Notes: • Need to add chaincode version. Will be used for auditing, and perhaps for commit validation as well - especially upon chaincode upgrade…need to go through all upgrade scenarios, e.g. ensure simulation was done on latest chaincode version available.
Transaction execution - Version maintenance •
Version maintenance – Should be possible to detect if a key has changed between simulation and committing phase of a transaction (MVCC validation) – Versioning scheme for a unique version per key – two options: • Incrementing numbers (initial implementation) • txID of the last committed transaction that updated the key (implement with config option and compare)
•
Pro/Cons of using TxId as version identifier – Pro • Does not require introducing a new concept (e.g., auto-incrementing number for each key separately) • Consistent with popular bitcoin transaction structure – (key + version) is equivalent to 'input’; (key + newValue) is equivalent to UTXO output • Provides built-in provenance – a pointer to prior transaction for this key, that can easily be traversed backwards to track full history of a key over time • Separate fork ID not required in PoW for uniqueness
– Cons • Transaction ids significantly longer than incrementing numbers (txIds may be 32 bytes if used crypto hash of contents) in the case of pbft
Transaction execution - Simulation (Chaincode execution) •
Transaction simulation –
•
RocksDB contains latest state index for fast simulation queries –
–
•
A scheme for simulating a transaction on a consistent copy of the data
Index by composite key (ccId:keyId) • if chaincodes are limited in number, use a separate column family per chaincode (Configurable?) • Collocating keys of a chaincode for faster transaction simulation particularly for range scan queries Latest value encoded as [version:deleteMarker:latestValueBytes(if present)] • Value bytes can be file-offset-pointer to block storage for vary large values (configurable – default - over 1 MB?)
Tx simulation to perform on a stable snapshot, supporting concurrency – (initial two options): – –
Locking based concurrency control (initial implementation) • Read locks on RocksDB state by simulator(s) and write lock during commit Snapshot based concurrency control (implement with config option and compare under load) • Create a RocksDB snapshot and simulate on the snapshot • Does not prevent concurrent commit of new blocks
RocksDB State index
ccId+keyId
version+deleteMarker+latestValueBytes
• This is a simulation runtime optimization. Alternatively, state key index could point to ledger block/transaction storage write set, and we could read values from there as the single source of truth, but would not be as efficient. • Bitcoin uses a similar index in LevelDB for unspent transactions.
Transaction execution – Validation/Commit • Committing peer choreography – Receive ‘batches’ of transactions from consensus (ordering service) – Call Validation System Chaincode (VSCC) to ensure endorsement policy has been fulfilled – Call ledger to perform Multiversion Concurrency Control (MVCC ) check; remove invalid transactions; build ‘block’ of remaining valid transactions • Initial implementation with sequential validation • Extend to parallel validation of transactions in a block – Using lock manager that maintains one lock for each key (acquire locks sequentially and once all the locks are acquired; start performing validation) – Split transactions in conflict free batches by dependency analysis and perform validation in parallel – Call Committer System Chaincode (CSCC) via gossip to ensure final blocks are same across peers – Call ledger to commit validated block to file-based storage and update RocksDB indexes Notes: Also need to validate that transaction id has not already been used.
* Blue steps call ledger APIs