The language design, that on EVM v.s. that on Others

Besides,
as just a compiler,
it does have a fatal bug.
setup_seed function in codegen.ml lacks SLOAD just after PUSHing a location of array (a seed) and thus says
“if some array is not initialized, then overwrite Program Counter to 1.”
“if some array is initialized, then abort this work.”
This is a fatal bug.

And I warn NOT TO USE bamboo at all. (maybe you know well)
It’s fully useless because it’s designed lacking the ability of designing formality at all.
If tests are done correctly, this must be avoided, and all the source code looks like this.

I concluded bamboo is an insane compiler as its unreadability indicates.

Pen compiler is planned to have “Internal Type checker” .
This is a new type of type checking.

External Type Check is a statically type checking as done on a program run by a single computer before its code generation. Thus, the generated machine code does not have type checking process at all.
Internal Type Check is a type checking run dynamically but statically embedded in the compiled code, which means any computer in the world who runs the program must run the type checker. This is a decentralized Type Checker.

( @pirapira 's fatal mistake was the confusion between External Type Check and Internal Type Check. )

With Linear Type Checking or, Quantitative Type Theory (a practical Dependent Linear Type Theory introduced in 2016-2018) ,
we can determine how many times each function has to be used.
Pen compiler is planned to integrate Linear Type System with the Internal Type Checking first.

Now pen compiler has primitive lambda calculus and let-in syntax.
It’s the starting point of functional style language and modern static type checker.
Check, review, and criticize!

1 Like

Recursive Function is now supported in Pen Compiler !
It’s great step from imperative programming language into functional programming language !

( I have not run the code on any EVM yet. I Just made the compiler itself. )

( Linear Type checking is not yet supported.
Now we are going to the phase of development of the formality as written in https://bentnib.org/quantitative-type-theory.pdf and TaPL (Pierce 2000) )

EVM decompiler, yes it’s decompiler! , is now available in pen repository!
It’s at the alpha version!
You can compile it after installing Haskell;

git clone https://www.github.com/ghasshee/pen
cd pen/disasm 
make
cat ../count.hex | ./Disasm  

This is going to be used to develop pen compiler in the purpose of analyzing Solidity’s bytecode directly.
The architecture of pen is not solidated yet;
in a ‘Contract Oriented’ Langauge, a contract can call itself recursively,
which was @pirapira’s attempt, while solidity seems to have chosen the other way; contracts cannot call itself recursively.

1 Like

Awesome! Solute to your ambition.

Recently, i am reading the book TaPL, thinking about whether i can contribute to smart contract language!

After searching, i saw your amazing work, thank you for your efforts, and then I will study the design of the pen compiler and try to make contribute!

So, if you can, if you have the time, could you please write an introductory document for new developers of pen compiler. I think documentation will greatly help the development of the pen compiler.

1 Like

Today I fixed the bug that I could not find for around a year.
So now, we can deploy a simple counter contract (count.pen) that can work on EVM.

Please follow the README.md at

to deploy the counter contract on EVM.

The decompiler also works much better now than it was.
We can decompile EVM bytecode into Guarded-Command-Like Langauge (GCLL) with DEVM.

Please see this great book Formal Methods | SpringerLink
if you want to know much about Dijkstra’s Guarded Command Language - Wikipedia

Currently, we have two clear goals.

The first one is to develop DEVM into a tool which handles Program Graph and do formal things on “Linear Algebra over Semiring Lattice” : refer e.g.

The second one is to clean up the pen compiler which is really untidy still now.
It derives from bad, unreadable, and bug-friendly former project.
I, rather, would like to reduce dependencies in order to develop codes formally.
So, removing libraries such as menhir might be a good short term developing path.
We need to completely handle in much more sophisticated codes in modern functional programming styles such as MPRI 2-4 .

The road map is still long, I feel!

Another goal is to implement the Pen compiler from Scratch with Continuation Passing Style (CPS) transformation in Haskell.
CPS transforms Lambda Calculus into Procedural Languages like Dijkstra’s Guarded Command language in a very good manner.

I would like to derive “Contract-Oriented Language” which @pirapira started.
I might start another compiler which has the “Recursive Contract Construction”.

Another problem is that @pirapira mixed the term tree and their type.
It might lead to a bad arch when we would like to appreciate “functional pearls”.
Separating terms and types also enables us to insert a new Intermediate language between Pen lang and EVM opcodes quite easily.

To organize the architecture of the language, I started to think we must apply CPS into it.
The answer type of the continuation is the very contract type.
In @pirapira 's setting Methods return a then becomes contract A(_) ,
In our setting, this may be (Term -> A) -> A which is CPS.
Continuation is a monad Cont A.
In order to accomplish formality on typing, we must employ monad to wrapping the side effects.
What monad is it ? The answer is the continuation monad. We would like to build an Abstract Syntax Tree which is Functional Programming friendly style.
e.g. if we write this in pen lang;

if cond 
  then return a becomes contract A(x)
  else return b becomes contract A(y)

This is typable
we would like to make a AST as

TmIF
  + cond 
  + TmRET(a) ; Storage[_] := x 
  + TmRET(b) ; Storage[_] := y 

And the whole body is typable using Monad Type Class.
Obviously, this code have effects and we must wrap the effect into the Cont Monad using CPS transformation.

toCPS (TmABS x ty t)  k   = toCPS t $ \m -> k (TmABS x ty m) 
toCPS (TmAPP t s)    k   = toCPS t $ \m -> toCPS s $ \n -> k (TmAPP m n) 
toCPS (TmIF b t s)   k   = toCPS t $ \m -> toCPS s $ \n -> k (TmIF b m n) 
toCPS (TmRET a eff) k = toCPS a (k . eff)   

with this definition of CPS;

toCPS (TmIF(cond,TmRET(a);S[_]:=x, TmRET(b); S[_] := y)) 

is converted into something like this;

return a ; S[_] := x ; return b ; S[_] := y ; if cond jump 

which is reverse order of GCLL IR.

Haskell is a very nice language which is quite simple to write codes.
Using Tree Comonad , there is nothing but visibility in few lines, ( please see the source code of DEVM ).

The Language we want to create is “Functional” Term (term) ; i.e. there is no distinction between statements (stmt) and Expressions (expr), whereas low level formal languages such as dijkstra’s Guarded Command language is consists of stmt and expr, which is called Guarded Command Like Langauge (GCLL) in our development.

To begin with the language design, we should organize the name space well.
The Highest Level Language is represented by Monadic terms term wrapped by monad m.
The Second High Level Language is represented by CPS terms kont .
The Mid Level Language is represented by GCLL’s stmt and expr .
The Lowest Language is represented by Assembler asm or EVM opcodes [opcode] .

The compilation is like;

src --parse-> m term --(typecheck)-> m term --CPS->  kont --(CPS App)-> GCLL(stmt) ---> [opcode] 

One purpose of the type-check phase is to embed dynamical type-checking system as a monad, as mentioned earlier on this thread. The other purpose is the statically checking which is described in TaPL (Types and Programming Languages). We postpone type-checking construction phase until we construct the rest.

In our settings, monadic term m t is intuitively defined as "adding GCLL effects to the term t ".
So monad m can be either adding dynamical type checking code or adding the continuation of the contract, which proves that the combination of monad and cps is much better design enabling what we want to do.

Flush

In DEVM decompiler, we used Red-Black Tree to represent AST; Red tree represents a subtree that returns expression and Black tree a subtree that does not returns expression but makes effects.

In PEN compiler, we would like to do in the similar way,
Black node in the AST is effect; e.g. Storage Assignment, Send transaction, … e.t.c.

Storage assignment does not occur until it reaches the end of method.
It means storage assignment in the middle of a method is replaced with a variable assignment in EVM memory. and when all what should be done are done, the variable is flushed to the corresponding Storage Address.

e.g.

counter : u256  

method foo() {
    counter := counter + 1 ;
    let a = 100 ;
    counter := a; 
    return 0 
} 

is actually

counter : u256  

method foo() {
    counter := (\c -> 
       let c' := c + 1;
       let a := 100 ;
       let c'' := a ;
       c'' ) counter  ;
    return 0 
} 

Here, counter is the only storage variable.
and its assignment is done only once just before it returns 0.

The design is described here and sometimes it is updated;

We are referring vc function at the Chapter 7 of Winskel’s Book (The formal semantics of Programming Langauges) as the design of the formal verification process which is going to be translated into SMT language http://smtlib.cs.uiowa.edu; i.e. we are going to depend on Z3 solver for the verification process.

This is the current development process.
We are going to completely abandon the OCaml src code which derives from pirapira’s bamboo.

Currently, Haskell-Written pen compiler’s code looks like this;

contract counter {
    
    counter : u256  ; 
    unusedvar : u256 ; 

    method inc : () :=              { counter >= 0 } 
        let counter = counter + 1 ; { counter > 0  }  
        counter 
    
    method get : u256 :=            { counter >= 0 } 
        counter                     { counter > 0  } 

    method reset : () :=            { counter >= 0 }  
        let x = 0; 
        let id x = x; 
        let counter = id x ;        { counter = 0  }  
        if x != 0 then 0 else counter  

} 

which is located here; pen/count.pen at 4eaeb8eafbb9def9d5fd4e646a25d71118595824 · ghasshee/pen · GitHub

The current development is concentrating on the procedure that transforms AST into program graph.
Once it is established we can have a lot of good analytics which is derived from simple linear algebra.

Program Graph might be then considered as a “weighted pushdown automata”.
“Weighted automata” is equivalent to linear algebra over semi-ring (or rig).
see https://perso.telecom-paristech.fr/jsaka/ENSG/MPRI/Files/Lectures/FLAT-cplt-190123.pdf for the introduction to weighted automata.

“send” / “transfer” functions

For " The DAO " avoidance,
all contracts “COULD” control the balance as they know who send what amount to them.
i.e. all pen contracts could have the table whose entries were the sender and whose values were amounts.

So, why not have a table inside each contract compiled by pen compiler ?
Then, in the language, the “send” function and “transfer” function is distinct.
“send” function sends some amount to a contract,
“transfer” function just change the table which is inside the contract.
i.e. “transfer” function does not send any value out of the contract to accounts outside.

Internally,
“send(A, B, amount)” function checks whether the amount is less than the table value of account B if the sender A is a contract.
“transfer(A, B, amount) " function first checks amount is less than the value of A in the table, then reduce the value of A by the amount, and then gain the table value of B by the amount.”

Now we are supporting Primitive Matrix Representation of Program Graph.
If we find the fixpoint in taking star-operation against the matrix, we will know all the program behavior.

We are not supporting all the language feature, but small subset.
However it seems a good first step towards a “Nice” compiler.

References :

(The latter one is an introduction to matrix representation over Non Commutative Semiring, i.e. Automata. See chapter 1.)

Besides to the above references, if we want to find the fixpoint, we should remove ambiguity of system configuration; i.e. stack or memory (or storage) states. To achieve this, primitive automaton is not a framework enough to represent properly, removing the ambiguity. The more proper system is “Pushdown Automata” which is described in chapter 7 of Handbook of Weighted Automata. Algebraic Systems and Pushdown Automata | SpringerLink.

Banning Send (Transfer) function put other than at the last of Program Graph

EVM is a machine which handles “money” transitions from account to account.
As described above, in pen compiler, money transition in a single contract that handles many people’s balance should not use “send” function in the internal money transitions, rather use “transfer” function which only rewrites balance table in the storage.

When running a contract, “send”/“transfer” can be a finalization of the contract running.
A contract should be made between one account and another account, and therefore, the finalization of money between them can be one time at the last in an event.

Suppose there is a program graph which is compiled to from Pen compiler’s source code;
O is some opcodes and M stands for a memory state change and C is a condition, and
S(or T) is a send function (or a transition function). The representation matrix (block matrix; i.e. O, M, C, and S is a block made of a small matrix) of the graph should be like this;
Here, M* is the star operation of M, as in the context of regular expression.

\ 1  2  3  4 
1    O  
2    M* C
3          S  
4    

In this graph, after we run S operation, there is no loop; i.e. M*. ( Each of rows and columns represents a node number, and Node 1 is the initial node, and Node 4 is the last or termination node . )
We could impose the constraint to the compiler, such that S(or T) must be in the last part of the program graph without the loss of usability of a contract.

This constraint also catches an error in compiling classical condition check order problem as in TheDAO source code. (Another way of avoiding TheDAO was described above; by separation of send/transfer and by managing balance tables).

We might take the constraint as that we cannot write a contract that sends 100 $ to randomly selected 100 people, i.e.;

contract Foo() {
  method send100people100dollars(){
     for account in [1..100] 
         send 100 to Addr[account];   
  } 
} 

But this can be avoided, setting proper analysis of for loop;i.e. each loop in a for loop accompanied by a state change. And in a context of representation matrices of Pushdown Automata rather than those of Automata, the one-loop is not a loop in the larger block matrix.

Thus, we should construct “proper” for-loop or functionals in the compiler, such that its corresponding part in the analysis of program graph, representation matrix is not a loop, by making matrix much larger taking the primitive matrix as block elements of the larger matrix.

Synthetic Way and Analytic Way

The most of us, including other formal verification project researchers, are relying on the “analytic” way of program verification. That is, program first, verification later. However, using such way means we write a program in a certain language and after that we verify the program satisfying the specification we want; the programming language has no restriction about their programs that can occur with respect to the specification. On the other hand, in using “synthetic” languages such as Coq, we specify the specification precisely and first. Thanking that there is Generalization of Curry-Howard Correspondence, program can be generated if and only if we proved that the program specification is right.

Here we saw two opposite ways of program verification, analytic v.s. synthetic.
I would not mention which is better because the both ways have each weak points.
e.g. synthetic way always can have trivial program. (Even if we write specifications S_1, ... S_n , and if we forget to add a specification S_{n+1} which we need, we can always have the trivial synthetic program. And, the proving technique in synthetic language such as Coq is sometimes much harder than formal verification of program analysis using approximation. )

Guard Condition

Concerning Pen Compiler, currently we have defined Syntax Tree, its Program Graph, and the representation matrix, that is, we have been developing the analytic side.
Particularly, concerning Program Graph, we did not decide how to handle loops in the graph.
One of the analytic way of loop verification is “invariant analyisis” which is studied in the theory of Hoare Logic. However, invariant analysis needs that we have to specify what the invariant is and it might be difficult to find it in cases. Then how to handle loops? For the question, we would like to apply synthetic way to pen compiler. It’s called “guard condition”.
Guard condition asserts that the loop branch condition has inductive data structure and each time we reach the condition, we have to shrink the structure, and it must shrink to just a “point structure”, breaking the loop.
For example, Coq compiler uses the guard condition in order that it assures that programs in Coq stops, and it is statically analyzed.

Many compilers, such as solidity, use Integer, but we might not need it.
Rather, we have Natural Numbers( , positive numbers, and integers as in Coq style).
Negative numbers can be represented by natural numbers in double-entry bookkeeping.
We might not need integers in formal ways.

We sometimes say loop branch condition is “noetherian” if it satisfies the guard condition, since the condition sequence is descending and has the end, as descending sequences in algebraic geometry is noetherian if it has the end point named from a great mathematician “Emmy Noether”.

Formal Opcodes

In such assumptions, we cannot use EVM’s ADD Opcode directory in adding two numbers since ADD works exactly as defined in EVM.
We have inductive data structures and we have to embed them into EVM. Since embedded EVM code works and it is called in EVM, static checking does not mean even if we check the code carefully in compile time. So, in embedding data structure, we have to define Formal Addition FADD to be addition with boundary condition check, FMUL to be multiplication with boundary condition check, and so on. Of course, the gas prices of the Formal Opcodes should be cheap.

MSO Logic and Formal Verification

First Order Logic is a Logic with first order variables i.e. variables x that we assign values.
Monadic Second Order (MSO) Logic is a First-order theory with limited use of second order variable X .
MSO logics can represent assumptions which some sort of struectures such as words, or trees need satisfying. It’s a formal logic of structures.
For brevity, one can consider “formal lexer” and “formal parser” with them.

Automata Theories

Given a word, an automaton returns the word can be accepted by it or not.
This is well known automata, which is used in lexers.
There is some variants, weighted one, transducers, infinite ones, tree automata, and, so on.
Particularly, given a tree, a tree automaton returns the tree can be accepted by it or not.
In automata theories, there is one to one correspondence between (word) automata and regular langauges (regular expressions). Besides, there is another correspondence tree automata and regular tree languages.

MSO Logic

Speaking of MSO Theory, surprisingly, there is another one to one correspondence.
Given a MSO Logic formulae, one can construct an automaton which accepts only the words satisfying the formulae. Of course, there is a tree-MSO version.
In the blog, I explain the detail of the conversion from MSO Logic to Automata more formally. See https://ghassheee.github.io/ethereum/510-MSO-Logic.md/.

Applications

Like MSO generated automata can decide that a word can satisfy the MSO Logic formulae,
It is well known that Linear Temporal Logic (LTL) can generat automata, deciding whether a program can satisfy the LTL formulae.
Various researches over this MSO-automata correspondence are now published.
See, e.g. https://janahofmann.github.io/files/dissertation_jana_hofmann.pdf

The Solver Architecture

For the sake of developping solver architecture of logical problems whether a program satisfies an assumption or a proposition, now(In coming months), pen compiler has (will have) Automata.hs (AlternatingAutomata.hs or SomeAutomata.hs) and MSO.hs (LTL.hs or SomeLogic.hs ) which can decide whether a word (a program) satisfy a MSO logic ( logical assumptions that the program must satisfy ).


Updates: I added the blog link which explains MSO-Automata Conversion formally .

EVM-Dafny

I have just known this project which seems to have quite clear and nice architecture.
Evm-dafny enables us to be free from bugs by implementing EVM semantics via Dafny langauge.
Elaboration of verification part is thrown to the Dafny language and we can thank both synthetic verification and analytic verification when we write contract using EVM-Dafny. I would like to try it.
Has anyone already tried it ?

There are some posts around EVM-Dafny language.

1 Like