Exploring the Risks Behind Categorizing Contracts in EVM

Risks of Categorizing Contracts in EVM

In the field of smart contracts, the “Ethereum Virtual Machine (EVM)” and its algorithms and data structures are first principles.

This article starts with why contracts need to be classified, combines each scenario’s possible malicious attacks, and finally gives a set of contract classification analysis algorithms to achieve relative safety.

Although the technical content is high, it can also be used as a miscellaneous reading to see the dark forest of games between decentralized systems.

1. Why do contracts need to be classified?

Because it is too important and can be said to be the cornerstone of exchanges, wallets, blockchain browsers, data analysis platforms, and other Dapps!

The reason why a transaction is an ERC20 transfer is that its behavior conforms to the ERC20 standard, at least:

  1. The transaction status is successful

  2. The To address is a contract that meets the ERC20 standard

  3. Called the Transfer function, whose feature is that the first 4 bits of the CallData of the transaction are 0xa9059cbb

  4. After execution, the transfer event is sent at that To address

Misclassification will result in misjudgment of transaction behavior

Based on the transaction behavior, the accurate classification of the To address will have completely different conclusions for the judgment of its CallData. For Dapps, the high-level communication of on-chain and off-chain information depends heavily on the listening of transaction events, and the same event encoding can only be issued in contracts that meet the standards to be credible.

Misclassification will result in transactions entering a black hole

If a user performs a token transfer and transfers it to a contract, if the contract does not have a preset token transfer function method, the funds will be locked like Burn and cannot be controlled.

And now a large number of projects have begun to increase built-in wallet support, so it is inevitable to manage wallets for users, and you need to classify the latest deployed contracts from the chain in real time to see if they can match asset standards.

2. What are the risks of classification?

On-chain is a place without identity and rule of law, and you cannot stop a normal transaction, even if it is malicious.

It can be a wolf pretending to be a grandmother, exhibiting many behaviors that fit your expectations of a grandmother, but with the goal of entering the house to rob it. The declaration standard may not match the actual behavior.

The common way is to directly use the EIP-165 standard to read whether the address supports ERC-20 and other tokens. Of course, this is an efficient method, but after all, the contract is controlled by the other party, so a declaration can ultimately be forged.

Querying the 165 standard is just a method to prevent funds from being transferred to a black hole at the lowest cost in the limited opcode on the chain.

That’s why when we previously analyzed NFT, we specifically mentioned that there will be a type of SafeTransferFrom method in the standard, where Safe represents the use of the 165 standard to determine that the other party declares that they have the transfer ability of the NFT.

Only by starting from the contract bytecode and doing static analysis at the source code level from the expected behavior of the contract can there be more accurate possibilities.

3. Contract Classification Scheme Design

Next, we will analyze the overall plan systematically. Note that our goal is to achieve “accuracy” and “efficiency,” two core indicators.

To know that even if the direction is correct, the road to the other side of the ocean is not clear. The first stop to do bytecode analysis is to get the code.

3.1. How to get the code?

From the perspective of being on the chain, there is getCode, an RPC method that can get the bytecode from the specified address on the chain. In terms of reading alone, this is very fast because the codeHash is placed at the top of the account structure of the EVM.

However, this method is equivalent to getting it separately for a specific address, how can we further improve accuracy and efficiency?

If it is a transaction for deploying a contract, how can we get the deployed code immediately after it is executed or even when it is still in the memory pool?

If the transaction is in the mode of a contract factory, does the source code exist in the transaction’s Calldata?

My final method is to use a sieve-like pattern to classify:

  1. For transactions that are not contract deployments, use getCode to obtain the involved addresses for classification.
  2. For the latest transactions in the memory pool, filter out the transactions where the to address is empty. The CallData is the source code with the constructor.
  3. For transactions in the contract factory mode, since there may be contracts deployed by the contract that are called cyclically to execute deployment, recursively analyze the sub-transactions of the transaction, and record each Call with the type CREATE or CREATE 2.

When I was implementing the demo, I found that the current version of RPC is relatively high. The most difficult part of the whole process is how to recursively find the specified call when executing 3. The bottom layer method is to restore the context through the opcode, which surprised me!

Fortunately, there is a debug_traceTransaction method in the current version of Geth, which can help to solve the problem of sorting out the context information of each call through the opcode, and organize the core fields.

Finally, the original bytecode of multiple deployment modes (direct deployment, factory mode single deployment, factory mode batch deployment) can be obtained.

3.2, How to classify code?

The simplest but not secure way is to directly match the code as a string. Taking ERC 20 as an example, the functions that conform to the standard are:

The one after the function name is the function signature. It was mentioned before when analyzing that transactions rely on the first 4 bits of matching callData to find the target function. For further reading:

Therefore, the signature of these 6 functions must be stored in the contract bytecode.

Of course, this method is very quick, just find all 6 and it’s done, but the unsafe factor is that if I use a separate variable in the Solidity contract and store a value of 0x18160ddd, it will also be considered that I have this function.

3.3, Accuracy improvement 1-decompilation

Then, the more accurate method further is to decompile the Opcode! Decompilation is the process of converting the obtained bytecode to the opcode. More advanced decompilation is to convert it to pseudocode, which is more conducive to human reading. We don’t need it this time. The decompilation methods are listed in the appendix at the end of the article.

Solidity (high-level language)->bytecode->opcode

We can clearly see a feature that function signatures will be executed by the PUSH 4 opcode. So the further method is to extract the content after PUSH 4 from the full text and match it with the function standard.

I also did a simple performance test, and I have to say that the efficiency of the Go language is very powerful. Only 220ms is needed for 10,000 decompilations.

The following content will be somewhat difficult.

3.4, Accuracy Improvement 2 – Finding Code Blocks

Although the accuracy has improved, it is still not enough because we search for PUSH 4 in the entire text. We can still create a variable that is of the type byte 4, which would trigger the PUSH 4 instruction.

While I was struggling with this, I thought of some open source projects’ implementations. ETL is a tool that reads on-chain data for analysis, and it separates ERC 20 and 721 transfers into separate tables, so it must have the ability to classify contracts.

After analyzing it, it can be found that it is based on code block classification and only processes the PUSH 4 instruction in the first basic_blocks[0].

The problem now is how to accurately determine the code block

The concept of code blocks originates from the two consecutive opcodes REVERT + JUMPDEST, which necessarily require two consecutive ones because in the opcode range of the entire function selector, if there are too many functions, there will be a paging logic, and JUMPDEST instruction will also appear.

3.5, Accuracy Improvement 3 – Finding Function Selectors

The function selector reads the first 4 bytes of Calldata of the transaction and matches it with the contract function signature preset in the code to assist the instruction to jump to the memory location specified by the function method.

Let’s try a minimum simulation execution.

This part is two function selectors store(uint 256) and retrieve(), and the signature can be calculated as 2e64cec1, 6057361d.

After decompilation, the following opcode string is obtained, which can be said to be divided into two parts.

First Part:

In the compiler, only the function selector part in the contract will obtain the contents of callData, meaning to obtain the function call signature of its CallData, as shown in the figure below.

We can look at the effect by simulating the memory pool change of the EVM.

Second Part:

The process of checking whether it matches the selector value.

1. Put the 4-byte function signature of retrieve() (0x2e64cec1) on the stack.

2. The EQ opcode pops two variables, 0x2e64cec1 and 0x6057361d, from the stack and checks if they are equal.

3. PUSH 2 puts 2 bytes of data (0x003b, decimal 59) on the stack. The stack has an item called the program counter, which specifies the position of the next executing command in the bytecode. Here we set it to 59 because that is the starting position of the retrieve() bytecode.

4. JUMPI means “if …, then jump to …”. It pops two values from the stack as inputs and if the condition is true, the program counter will be updated to 59.

This is how the EVM determines the position of the bytecode to be executed based on the function call in the contract.

In fact, this is just a set of simple “if statements” for each function in the contract and their jump positions.

4. Summary of plan

The overview is as follows:

  1. Each contract address can obtain the deployed bytecode through rpc getcode or debug_traceTransaction, and the opcode can be obtained by decompiling using the GO VM and ASM libraries.

  2. Contracts in EVM operation have the following characteristics:

  • Use REVERT+JUMPDEST as the code block delimiter

  • The contract must have a function selector function that is necessarily on the first code block

  • The function selector uses PUSH 4 as the opcode for the function method

  • The opcode contained in the selector will have consecutive PUSH 100; CALLDATALOAD; PUSH 1e0; SHR; DUP 1, and the core function is to load callDate data and perform a shift operation, which will not produce any other syntax from the contract function

3. The corresponding function signature is defined in EIP and has mandatory and optional explicit descriptions.

4.1 Uniqueness Proof

So at this point, we can say that we have basically implemented an efficient and accurate contract analysis method. Of course, since we have been rigorous for so long, we might as well be more rigorous. In our previous plan, we used REVER+JUMPDEST to distinguish code blocks, combined with the necessary CallDate loading and shifting to determine uniqueness. Can I implement a similar opcode sequence in solidity contracts?

I did a comparative experiment. Although there are methods in Solidity syntax to obtain CallData, such as msg.sig, the implementation of its opcode is different after compilation.