wezm.net/v2/content/posts/2023/divmod.md

+++
title = "divmod, Rust, x86, and Optimisation"
date = 2023-01-11T19:48:09+10:00

#[extra]
#updated = 2022-04-21T09:07:57+10:00
+++

While reviewing some Rust code that did something like this:

```rust
let a = n / d;
let b = n % d;
```

I lamented the lack of a `divmod` method in Rust (that would return both the
quotient and remainder). My colleague [Brendan] pointed out that he actually
[added it][rust-div-mod] back in 2013 but it was moved out of the standard
library before the 1.0 release.

<!-- more -->

I also learned that the [`div` instruction on x86][div] provides the remainder
so there is potentially some benefit to combining the operation. I suspected
that LLVM was probably able to optimise the separate operations and a trip to
[the Compiler Explorer][compiler-explorer] confirmed it.

This function:

```rust
pub fn divmod(n: usize, d: usize) -> (usize, usize) {
    (n / d, n % d)
}
```

Compiles to the following assembly, which I have annotated with my
understanding of each line (Note: I'm still learning x86 assembly):

```asm
; rdi = numerator, rsi = denominator
example::divmod:
        test    rsi, rsi     ; check for denominator of zero
        je      .LBB0_5      ; jump to div zero panic if zero
        mov     rax, rdi     ; load rax with numerator
        or      rax, rsi     ; or rax with denominator
        shr     rax, 32      ; shift rax right 32-bits
        je      .LBB0_2      ; if the result of the shift sets the zero flag then numerator and
                             ; denominator are 32-bit since none of the upper 32-bits are set.
                             ; jump to 32-bit division implementation
        mov     rax, rdi     ; move numerator into rax
        xor     edx, edx     ; zero edx (I'm not sure why, might be relevant to the calling
                             ; convention and is used by the caller?)
        div     rsi          ; divide rax by rsi
        ret                  ; return, quotient is in rax, remainder in rdx

; 32 bit implementation
.LBB0_2:
        mov     eax, edi     ; move edi to eax (32-bit regs)
        xor     edx, edx     ; zero edx
        div     esi          ; divide eax by esi
        ret

; div zero panic
.LBB0_5:
        push    rax
        lea     rdi, [rip + str.0]
        lea     rdx, [rip + .L__unnamed_1]
        mov     esi, 25
        call    qword ptr [rip + core::panicking::panic@GOTPCREL]
        ud2

.L__unnamed_2:
        .ascii  "/app/example.rs"

.L__unnamed_1:
        .quad   .L__unnamed_2
        .asciz  "\017\000\000\000\000\000\000\000\002\000\000\000\006\000\000"

str.0:
        .ascii  "attempt to divide by zero"
```

I found it interesting that after checking for a zero denominator there's an
additional check to see if the values fit into 32-bits, and if so it jumps to an
instruction sequence that uses 32-bit registers. According to [the testing done
in this report][timing] 32-bit `div` has lower latency—particularly on older
models.

I wasn't able to work out why each implementation zeros `edx`. If you know,
send me a message and I'll update the post.

[View the Example on Compiler Explorer](https://rust.godbolt.org/z/hj9rb4Txa)

[Brendan]: https://github.com/brendanzab
[rust-div-mod]: https://github.com/rust-lang/rust/commit/f39152e07baf03fc1ff4c8b2c1678ac857b4a512
[div]: https://www.felixcloutier.com/x86/div
[compiler-explorer]: https://rust.godbolt.org/
[timing]: https://gmplib.org/~tege/x86-timing.pdf
Add divmod post 2023-01-11 10:32:56 +00:00			`+++`
			`title = "divmod, Rust, x86, and Optimisation"`
			`date = 2023-01-11T19:48:09+10:00`

			`#[extra]`
			`#updated = 2022-04-21T09:07:57+10:00`
			`+++`

			`While reviewing some Rust code that did something like this:`

			```rust
			`let a = n / d;`
			`let b = n % d;`
			```

			I lamented the lack of a `divmod` method in Rust (that would return both the
			`quotient and remainder). My colleague [Brendan] pointed out that he actually`
			`[added it][rust-div-mod] back in 2013 but it was moved out of the standard`
			`library before the 1.0 release.`

			`<!-- more -->`

			I also learned that the [`div` instruction on x86][div] provides the remainder
			`so there is potentially some benefit to combining the operation. I suspected`
			`that LLVM was probably able to optimise the separate operations and a trip to`
			`[the Compiler Explorer][compiler-explorer] confirmed it.`

			`This function:`

			```rust
			`pub fn divmod(n: usize, d: usize) -> (usize, usize) {`
			`(n / d, n % d)`
			`}`
			```

			`Compiles to the following assembly, which I have annotated with my`
			`understanding of each line (Note: I'm still learning x86 assembly):`

			```asm
			`; rdi = numerator, rsi = denominator`
			`example::divmod:`
			`test rsi, rsi ; check for denominator of zero`
			`je .LBB0_5 ; jump to div zero panic if zero`
			`mov rax, rdi ; load rax with numerator`
			`or rax, rsi ; or rax with denominator`
			`shr rax, 32 ; shift rax right 32-bits`
			`je .LBB0_2 ; if the result of the shift sets the zero flag then numerator and`
			`; denominator are 32-bit since none of the upper 32-bits are set.`
			`; jump to 32-bit division implementation`
			`mov rax, rdi ; move numerator into rax`
			`xor edx, edx ; zero edx (I'm not sure why, might be relevant to the calling`
			`; convention and is used by the caller?)`
			`div rsi ; divide rax by rsi`
			`ret ; return, quotient is in rax, remainder in rdx`

			`; 32 bit implementation`
			`.LBB0_2:`
			`mov eax, edi ; move edi to eax (32-bit regs)`
			`xor edx, edx ; zero edx`
			`div esi ; divide eax by esi`
			`ret`

			`; div zero panic`
			`.LBB0_5:`
			`push rax`
			`lea rdi, [rip + str.0]`
			`lea rdx, [rip + .L__unnamed_1]`
			`mov esi, 25`
			`call qword ptr [rip + core::panicking::panic@GOTPCREL]`
			`ud2`

			`.L__unnamed_2:`
			`.ascii "/app/example.rs"`

			`.L__unnamed_1:`
			`.quad .L__unnamed_2`
			`.asciz "\017\000\000\000\000\000\000\000\002\000\000\000\006\000\000"`

			`str.0:`
			`.ascii "attempt to divide by zero"`
			```

			`I found it interesting that after checking for a zero denominator there's an`
			`additional check to see if the values fit into 32-bits, and if so it jumps to an`
			`instruction sequence that uses 32-bit registers. According to [the testing done`
			in this report][timing] 32-bit `div` has lower latency—particularly on older
			`models.`

			I wasn't able to work out why each implementation zeros `edx`. If you know,
			`send me a message and I'll update the post.`

			`[View the Example on Compiler Explorer](https://rust.godbolt.org/z/hj9rb4Txa)`

			`[Brendan]: https://github.com/brendanzab`
			`[rust-div-mod]: https://github.com/rust-lang/rust/commit/f39152e07baf03fc1ff4c8b2c1678ac857b4a512`
			`[div]: https://www.felixcloutier.com/x86/div`
			`[compiler-explorer]: https://rust.godbolt.org/`
			`[timing]: https://gmplib.org/~tege/x86-timing.pdf`