Performance Research and Optimization on CPython’s Interpreter

Huaxiong Cao, Naijie Gu, Kaixin Ren, and Yi Li
1) Department of Computer Science and Technology, University of Science and Technology of China
2) Anhui Province Key Laboratory of Computing and Communication Software
3) Institute of Advanced Technology, University of Science and Technology of China
Hefei, China, 230027
Email: chx319@mail.ustc.edu.cn, gunj@ustc.edu.cn

Abstract—In this paper, the performance research on CPython’s latest interpreter is presented, concluding that bytecode dispatching takes about 25 percent of total execution time on average. Based on this observation, a novel bytecode dispatching mechanism is proposed to reduce the time spent on this phase to a minimum. With this mechanism, the blocks associated with each kind of bytecodes are rewritten in hand-tuned assembly, their opcodes are renumbered, and their memory spaces are rescheduled. With these preparations, this new bytecode dispatching mechanism replaces the time-consuming memory reading operations with rapid operations on registers.

This mechanism is implemented in CPython-3.3.0. Experiments on lots of benchmarks demonstrate its correctness and efficiency. The comparison between original CPython and optimized CPython shows that this new mechanism achieves about 8.5 percent performance improvement on average. For some particular benchmarks, the maximum improvement is up to 18 percentages.

I. INTRODUCTION

The past decade has witnessed the widespread use of Python, which is a typical dynamic language designed to execute on virtual machines. Several software engineering advantages over statically compiled binaries, including portable program representations, thread manage -ment, some safety guarantees, built-in automatic memory, and dynamic program composition through dynamic class loading, are provided by Python. These advanced features enhance the user programming model and drive the success of Python language. However, these features also require the dynamic compilers to do quite a lot of extra operations, such as type checking, wrapping/unwrapping of boxed values, virtual method dispatching, bytecode dispatching, etc. These extra operations usually make Python programs several or dozens of times slower than static programs achieving the same functionality. Moreover, traditional static program optimization technologies are frustrated, introducing new challenges for achieving high performance.

In response, more and more researchers have paid their attention to optimize dynamic language compilers. The technologies proposed aim to improve performance by monitoring programs’ behavior and using this information to drive optimization decisions [1]. The dominant concepts that have influenced effective optimization technologies in today’s virtual machines include JIT compilers, interpreters, and their integrations.

JIT (just-in-time) techniques exploit the well-known fact that large scale programs usually spend the majority of time on a small fraction of the code [2]. During the execution of interpreters, they record the bytecode blocks which have been executed more than a specified number of times, and cache the binary code associated to these blocks. The next time these bytecode blocks are executed, it has no need to interpret these bytecodes once again, just jumps to the cached binary code, and continues the execution. By this way, much interpreting work is omitted, and performance improvement can be achieved. However, since the threshold is quite large in general, the interpreting stage still accounts for a large proportion of total execution time. JIT strategies work well for forwhile blocks which likely exist in science compute field. For the programs, whose purposes are remote configuration, warning, tracing, statistics, communication, etc., it is very hard to find hot blocks. Though not large, these programs are usually executed quite frequently.

Interpreters have a series of advantages which make them attractive [3]. Firstly, optimizing interpreters reduces the uptime of all Python programs with or without hot blocks. Secondly, they are quite simpler to construct than JIT compilers, making them quicker, more reliable to construct and easier to maintain. Thirdly, interpreters require less memory than JIT compilers, for both the interpreted virtual machine code and the interpreter itself. Interpreters for dynamic language have two main structures [1]: switch-based structure and threaded-based structure, whose details are given in section 2 part B. The latter is the latest and most efficient mechanism. Recently, many researchers have applied dynamic techniques to improve the performance of threaded-based structure. Piumarta and Riccardi [4] describe their techniques to dynamically generate threaded codes for purpose of eliminating a central dispatch site and inlining common bytecode sequences. Ertl and Gregg [5] extend Piumarta and Riccardi’s work by duplicating bytecode sequences, researching various interpreter generation heuristics.
and concentrating on improving branch prediction accuracy. Gagnon and Hendren [6] adapt Piumarta and Riccardi’s research to work in the context of multithreading and dynamic class loading. Sullivan et al. [7] describe a combination between an interpreter implementation and a dynamic binary optimizer, which enhances the efficacy of the underlying dynamic binary optimizer during the execution of interpreter.

Despite the enhancements above, interpreters are still worse than static compilers from the runtime perspective. The purpose of our research is to explore new ways to achieve further performance improvement. In this paper, the performance research on CPython’s latest interpreter is presented, finding that bytecode dispatching takes about 25 percent of total execution time on average. Then, a novel bytecode dispatching mechanism, which aims at reducing memory reading operations during bytecode dispatching and reducing the time spent on this phase to a minimum, is proposed. This mechanism is implemented inside CPython’s interpreter. Experiments on lots of benchmarks demonstrate the correctness and efficiency of our new mechanism. The comparisons between original CPython and optimized CPython show that our new mechanism can achieve about 8.5 percent performance improvement on average. For some particular benchmarks, the maximum improvement is up to 18 percentages.

The remainder of this paper is structured as follows: In the next section, we make comparisons among different compilers of Python and present both switch-based and threaded-based mechanisms. Section 3 reports our performance research on CPython’s interpreter. Section 4 discusses the main techniques we adopted to construct a new bytecode dispatching mechanism. Benchmarks and Experiments are given in section 5. Section 6 discusses the related work. We conclude this paper in the last section.

II. BACKGROUND
A. CPython VS Other Python compilers

A series of dynamic compilers are designed to run Python programs, such as CPython, Jython [8], IronPython [9], Pyston [10], PyPy [11], etc. The comparisons among them are shown on Table I, and these benchmarks are provided officially by Pyston/minibenchmarks and Pyston/microbenchmarks. Among these compilers, CPython is the official and standard compiler for Python language, it can support all the grammars and extensions. Others are developed for special applications. They focus on particular scenarios and take in-depth optimization passes. In this context, these compilers show their excellent performances for some benchmarks, but for other benchmarks, they may be several times slower than CPython. As shown on Table I, taking fid.py for example, it takes CPython 3.696 seconds to execute this script, while IronPython, Pyston and PyPy spend less than 2 seconds on the same script. However, it takes CPython 1.082 seconds to execute emwomding.py, while JPython, IronPython, Pyston and PyPy spend more than 3.700 seconds on the execution of emwomding.py.

What’s more, some benchmarks, like pydigits.py, empth_lo-op.py, vecf_add.py, ng.py, raytrace.py, cannot be executed successfully by some of these compilers. Based on these comparisons, our research focuses on CPython, and has not only practical application value, but also instruction meaning for the improvement of other interpreters.

B. Switch-based Mechanism VS Thread-based Mechanism

The performance of interpreters depends heavily on their bytecode dispatching mechanisms. CPython’s interpreter provides two main bytecode dispatching mechanisms: switch-based mechanism and thread-based mechanism.

The inner loop of switch-based interpreters is quite simple: jump to the dispatch point, fetch the next bytecode and dispatch to its implementation through a switch statement. Its typical framework is shown in Fig. 1.

As shown in Fig. 1, the interpreter is an infinite loop with a big switch block to dispatch bytecodes successively. Each bytecode are implemented by a particular case in the body of this switch block. At the end of each case, control is passed back to the beginning of the infinite loop by breaking out of this switch block. Traditional C compilers like GCC translate this switch block into a series of comparison statements. Considering a particular bytecode whose opcode is BINARY_SUBTRACT, five indispensable comparisons should be executed before reaching its associated block. Assuming that all bytecodes have the same probability of occurrence, it takes N/2 (N is the total kinds of bytecodes) comparisons on average to find the corresponding block. As to

<table>
<thead>
<tr>
<th>Benchmarks</th>
<th>CPython-3.3.0</th>
<th>CPython</th>
<th>IronPython</th>
<th>Pyston</th>
<th>PyPy</th>
</tr>
</thead>
<tbody>
<tr>
<td>empty_loop.py</td>
<td>3.532s</td>
<td>5.491s</td>
<td>Failed</td>
<td>19.248s</td>
<td>0.248s</td>
</tr>
<tr>
<td>pydigits.py</td>
<td>0.034s</td>
<td>2.120s</td>
<td>1.431s</td>
<td>Failed</td>
<td>0.039s</td>
</tr>
<tr>
<td>tif.py</td>
<td>3.696s</td>
<td>4.776s</td>
<td>1.527s</td>
<td>0.636s</td>
<td>0.864s</td>
</tr>
<tr>
<td>allgroup.py</td>
<td>9.890s</td>
<td>12.970s</td>
<td>25.150s</td>
<td>Failed</td>
<td>0.059s</td>
</tr>
<tr>
<td>chaso.py</td>
<td>9.890s</td>
<td>12.970s</td>
<td>25.150s</td>
<td>Failed</td>
<td>0.059s</td>
</tr>
<tr>
<td>go.py</td>
<td>26.268s</td>
<td>33.091s</td>
<td>51.368s</td>
<td>Failed</td>
<td>1.392s</td>
</tr>
<tr>
<td>rbody.py</td>
<td>53.787s</td>
<td>56.746s</td>
<td>123.404s</td>
<td>Failed</td>
<td>33.638s</td>
</tr>
<tr>
<td>ng.py</td>
<td>12.776s</td>
<td>19.535s</td>
<td>17.057s</td>
<td>25.540s</td>
<td>1.470s</td>
</tr>
<tr>
<td>nq.py</td>
<td>50.897s</td>
<td>12.776s</td>
<td>17.057s</td>
<td>25.540s</td>
<td>1.470s</td>
</tr>
<tr>
<td>raytrace.py</td>
<td>11.608s</td>
<td>16.418s</td>
<td>26.724s</td>
<td>Failed</td>
<td>44.418s</td>
</tr>
<tr>
<td>polymorphism.py</td>
<td>4.358s</td>
<td>7.223s</td>
<td>7.959s</td>
<td>4.391s</td>
<td>14.260s</td>
</tr>
<tr>
<td>unwinding.py</td>
<td>1.082s</td>
<td>3.757s</td>
<td>2.802s</td>
<td>93.180s</td>
<td>4.481s</td>
</tr>
</tbody>
</table>
CPython, N is 101. So, performing bytecode dispatching wastes the majority of execution time and it is very inefficient.

Threaded-based mechanism is the latest and most efficient mechanism for interpreters, popularized by the Forth programming language [12]. There are many kinds of threaded-based interpreters, and direct threading is regarded as the most efficient one. Direct threading mechanism improves performance by eliminating redundant comparisons. In addition, rather than returning to a central dispatch point, the implementation of each direct threading opcode ends with the particular code required to dispatch the next opcode. This optimization eliminates the centralized dispatch, removing lots of jump instructions. The framework of direct threading mechanism is shown in Fig. 2.

As shown in Fig. 2, execution starts with fetching the address of the very first bytecode’s implementation and then jumping to that address. For each bytecode, it performs its own work at first, then increases the instruction pointer, thirdly fetches the address of next bytecode’s implementation from memory, and jumps to the target address to handle successive bytecode. The native instructions associated with bytecode dispatching is shown in Fig. 3, with one stack reading, one memory reading and one jump. It can be seen from Fig. 3 that the bytecode dispatching overheads associated with direct threading mechanism are quite lower than those associated with switch-based mechanism.

III. MOTIVATION: PERFORMANCE RESEARCH

So far, it can be seen that bytecode dispatching plays a very important role in the performance of CPython. To make this concept intuitional, a series of meticulous experiments are conducted and the experimental results are shown in table II.

To make the results credible and accurate, this work utilizes hardware performance counters provided by Intel, and calculates the number of ticks which are consumed by the procedure to find next bytecode. The results are shown in Table II column 2. The ticks, which are consumed by the total program execution are also recorded and listed in Table II column 3. In Table II column 4, the ratios between them are calculated and listed. According to this table, it can be sure that the time consumed by bytecode dispatching is 25 percent of the total execution time on average. So if this stage could be further optimized, the total performance of CPython will be improved obviously.

As shown in Fig. 3, the bytecode dispatching in direct threading mechanism is composed of one stack reading, one jump and one memory reading. Inside the memory reading, there are an addition and a multiplication. Since operations on stack are quite frequent, the first stack reading instruction will also hit the L1 cache. Table III [13] lists the time needed to read data from register, L1 cache, L2 cache, memory and disc. According to this table, the first instruction takes about 2ns. The second instruction contains an addition operation, a multiplication operation and a load operation. The first two operations can be finished within several ticks. However, since L1 cache can only contain eight pages (32k L1 cache, page size 4k), reading the address of successive bytecode’s implementation leads to lots of L1 cache misses. In this context, it has to get the right value from L2 cache, or even memory, which may take dozens of nanoseconds. Assuming that reading from these three memory structures has the same probability of occurrence, it takes about 30ns to finish this memory reading. So, it can be seen that the second instruction takes the majority of the time spent on bytecode dispatching, and optimizing this instruction will contribute a lot to bytecode dispatching.

compiled code:
void * table[] = { ...
  ##LOAD_FAST, 
  ##LOAD_CONST, 
  ##BINARY_ADD, 
  ##BINARY_SUBTRACT, ... }; 
bytecode implementations:
For( ; ; ) {
  insn = get_next_insn(insn);
  opcode = get_opcode(insn);
  switch(opcode) {
  case NOP: ...; break;
  case LOAD_FAST: ...; break;
  case LOAD_CONST: ...; break;
  case BINARY_ADD: ...; break;
  case BINARY_SUBTRACT: ...;
  break;
  ...
  }
}

Fig. 1 The framework of switch structure.

compiled code:
Unsigned char code[] = { ...
  LOAD_FAST,
  LOAD_CONST,
  BINARY_ADD,
  BINARY_SUBTRACT, ... }; 
Bytecode implementations:
For( ; ; ) {
  insn = get_next_insn(insn);
  opcode = get_opcode(insn);
  switch(opcode) {
  case NOP: ...; break;
  case LOAD_FAST: ...; break;
  case LOAD_CONST: ...; break;
  case BINARY_ADD: ...; break;
  case BINARY_SUBTRACT: ...;
  break;
  }
}

Fig. 2 The framework of direct threading mechanism.
Phase 2: calculation. This phase calculates the proper size of each memory unit, named BSIZE. Inside the optimized interpreters, the memory allocated to each kind of bytecodes is an integral multiple of BSIZE bytes. Mark $\text{binary}[i]$ as the length of binary code associated with $i$th kind of bytecodes, the BSIZE should be the minimum positive number which conforms to condition (1) and condition (2).

$$\exists i: 2^i = \text{BSIZE}$$

$$\sum_{i=0}^{100} \frac{\text{binary}[i]}{\text{BSIZE}} \leq 256$$

The first constraint assures the calculation of next bytecode’s address is simplified to a quick left shift. The second constraint assures that the maximum opcode of bytecodes isn’t bigger than the threshold value (256) defined by CPython. If the BSIZE is too big, there will be a lot of NOP instructions and the executable file will be quite big, causing damage to the performance. That’s why BSIZE should be set as small as possible. According to Table IV, the proper value of BSIZE is 128.

Phase 3: opcode redefinition. This phase redefines the opcodes of bytecodes, with their order stay the same. Let $\text{opcode}[i]$ be the new opcode of $i$th kind of bytecode, and the algorithm used here is shown as follows:

$$\text{opcode}[0] = 0$$

$$\text{opcode}[i] = \text{opcode}[i-1] + \left\lfloor \frac{\text{binary}[i-1]}{\text{BSIZE}} \right\rfloor, 1 \leq i \leq 100$$

Taking the first three bytecode (POP_TOP, ROT_TWO and ROT_THREE) as an example, their original opcodes are 1, 2 and 3, respectively. Assuming the first bytecode has 200 byte binary code, the second bytecode has 350 byte binary code, and the third bytecode has 100 byte binary code, their final opcodes will be 1, 3 and 6. The memory allocation of the interpreter with new dispatching mechanism is shown in Fig. 4. As shown in Fig. 4, POP_TOP’s binary code is stored in the first grid region (from top to bottom), ROT_TWO’s binary code is stored in the second grid region, while the ROT_THREE’s binary code is stored in the third grid region. There are gaps between neighboring kinds of bytecodes. In addition, the framework of the new interpreter of CPython is shown in Fig. 5. Inside this new interpreter, the binary code of all kinds of bytecodes is arranged in numerical order and each of them occupies several BSIZE memory spaces. So, every time CPython jumps to the implementation associated to next bytecode, it just need to execute this simple statement: “goto (base + BSIZE * opcode)”. Fig. 6 shows the native instructions associated with this statement, including one stack reading, one left shift, one addition and one jump. The middle two instructions are operations on registers and can be finish within
2 ticks (0.8ns). Hence, it just takes 2.8ns to get the address of next bytecode’s implementation, instead of 30ns. Since NEXTOP operation is executed so many times, this new structure will bring a lot of performance promotion.

bytecode implementations:
BASE:
...;
asm(".balign BSIZE \n\t");
LOAD_FAST:
...;
opcode = get_opcode(insn_next);
go to (BASE+opcode* BSIZE);
asm(".balign BSIZE \n\t");
LOAD_CONST:
...;
opcode = get_opcode(insn_next);
go to (BASE+opcode* BSIZE);
asm(".balign BSIZE \n\t");
BINARY_ADD:
...;
opcode = get_opcode(insn_next);
go to (BASE+opcode* BSIZE);
asm(".balign BSIZE \n\t");
BINARY_SUBTRACT:
...;
opcode = get_opcode(insn_next);
go to (BASE+opcode* BSIZE);

//read stack and get the opcode of next bytecode
mov 0x1c4(%ebp),%eax
shl $8, %eax //left shift
//BASE is an immediate value
addl $BASE, %eax
//jump to deal with next bytecode
jmp *%eax

Fig. 6 The native instructions associated with new bytecode dispatching.
remaining memory reading operations hit the cache. In another word, the remaining memory reading operations can be finished in less time.

VI. RELATED WORK

There are a large quantity of recent papers researching interpreter performance. Romer et al. [16] have reported the performance characteristics of some interpreters. Later, Ertl and Gregg [3] investigated the performance of recent efficient interpreters. Both of these two studies have found that almost every interpreters perform an exceptionally high number of indirect branches. Since most of indirect branches are caused by bytecode dispatching, their conclusion is consistent with this performance research reporting in section 3. Our research aims at reducing the time spent on the second instruction in Fig. 3 to a minimum, while their work aims at optimizing the third instruction in Fig. 3 and reducing indirect branches mispredictions. This is the main difference between us.

Several techniques are used to reduce indirect branches mispredictions. J Hoogerbrugge and L Augusteijn [17], [18] have proposed that software pipelining interpreters is a way to reduce dispatch branch cost on architectures with split indirect branches. In addition, Subroutine threading [19] has also been proposed to avoid the overheads of indirect branches in interpreter implementations. Each bytecode is implemented with a particular C function. Instead of dispatching or interpreting bytecode, a simple JIT compiler generates executable code for a sequence of calls to these functions. This method can eliminate indirect branches at the cost of sacrificing both simplicity and portability.

Cache misses have a significant impact on the program performance [20]. Brunthaler [21] have proposed a formalization of interpreter opcode ordering (bytecode scheduling) for an interpreter with an extended opcode set, and concluded that high cache miss ratio is another bottleneck of interpreters. A lot of techniques, like feedback-guided technique [22], profile-guided technique [23], etc, are conducted to achieve better orderings, improving code locality and reducing cache misses. Jason McCandless and his co-worker [24] implement a metaheuristic (Monte Carlo) to generate better orderings, achieving considerable performance improvement. As shown in section 5, cache misses can also be reduced by our new mechanism, making the new interpreter perform better.

Another method which is widely used is combining sequences of VM instructions into super-instructions [25]. This technique focuses on reducing the number of bytecode dispatches, and has two variants: static super-instructions and dynamic super-instructions. The comparison between these two variants are shown in [5].

As far as we know, the nearest research to our work is [14]. The main difference is that their research plans for each bytecode a fixed-size block of instructions. In such a case, bytecodes with short instruction implementation will incur a lot of NOP instructions, which increases the code size of the interpreter dispatch loop and reduce cache hit ratio. In addition, bytecodes with long instruction implementation will lead to lots of extra jump instructions. Relatively speaking, our mechanism is more flexible and efficient than that, with much less NOP instructions and no extra jump.

VII. CONCLUSIONS

In this paper, the performance research on CPython’s interpreter is carried out, figuring out that bytecode dispatching has a big influence on interpreters. Then, a novel bytecode dispatching mechanism is designed, aiming at removing memory reading operations during bytecode dispatching and...
improvement. Interpreters, and will contribute to their performance
mechanism proposed here can also be adopted by other
reading operations and lesser cache misses. The novel
improvement is up to 18 percentages. This performance
average. For some particular benchmarks, the maximum
achieves about 8.5 percent performance improvement on
spaces.
order and each of them occupies several BSIZE memory
binary code of each kind of bytecodes is arranged in numerical
reducing the time spent on this phase to a minimum. The final
binary code of each kind of bytecodes is arranged in numerical
and optimized CPython to show that the new mechanism

REFERENCES
adaptive optimization in virtual machines,” Proc. of IEEE, vol. 93, no. 2,
http://dx.doi.org/10.1002/spe.4380010203
selective inlining,” ACM Sigplan Not., vol. 35, no. 5, pages 291-300, May,
accuracy in virtual machine interpreters,” ACM Sigplan Not., vol. 38, no. 5,
Java bytecode using preparation sequences,” in Compiler Construction,
G. Goos, J. Hartmanis and J. v. Leeuwen, Ed., Heidelberg, DE: Springer,
“Dynamic native optimization of interpreters,” in Proc. 2003 workshop on
50-57. http://dx.doi.org/10.1145/858570.858576
http://dx.doi.org/10.1007/978-1-4842-0055-1_12
FORTH_Resources/CM_ForthLanguageInteractiveComputing_1970.pdf
[16] T. H. Romer, D. Lee, G. M. Voelker, A. Wolman, W. A. Wong, J.-L. Baer,
B. N. Bershad, and H. M. Levy, “The structure and performance of
http://dx.doi.org/10.1145/246209.237175
3-540-46423-9_3
compression system based on pipelined interpreters,” Softw.: Practice
http://dx.doi.org/10.1002/spe.1097-024x(199909)29:11<1005::aid-spe
705>3.0.co;2-f
A flexible and efficient dispatch technique for virtual machine
http://dx.doi.org/10.1109/cgo.2005.14
[20] Ristov, and Sako, “Performance impact of reconfigurable L1 cache on
GPU devices,” Computer Science and Information Systems (FedCSIS),
507-510.
http://dx.doi.org/10.1007/978-3-642-19661-8_10
optimization,” in Proc. 2005 Int. Conf. Parallel Process. Workshops,
93548.93550
(2005). Optimizations for a java interpreter using instruction set
tech-reports/reports/05/TCD-CS-2005-01.pdf.