AI-Powered Rewrite from Python to Rust

In the previous part, we implemented the LogParser in Rust using PyO3 and maturin. All 74 contract tests passed - 37 for Python, 37 for Rust. We were confident that both implementations behave the same way. Or so we thought.

This article is part 4 of the series AI-Powered Rewrite from Python to Rust

Part 1 - Ensure existing code is well tested

Part 2 - Rewrite existing tests to contract tests and add a stub Rust wrapper

Part 3 - Build the Rust implementation

Part 4 - Use property-based testing to ensure the same output from Rust and Python for any given input

You can find the repository with all examples here

The limits of tests with hardcoded examples

Our contract tests are thorough. They cover all timestamp formats, all log levels, nested fields, noise lines, field types, and edge cases. That's 37 test cases exercising the parser from many angles.

But there's a catch. Every single one of those tests uses ASCII-only input. Service names like auth, error messages like timeout, field values like 42 and 99.99. We wrote tests for the cases we could think of. What about the ones we didn't?

This is where property-based testing with hypothesis comes in. Instead of writing specific test cases, we describe the shape of valid inputs and let the framework generate hundreds of random examples. If there's a mismatch between the two implementations on any generated input, we'll find it.

The prompt

Let's instruct our AI agent:

  Add Hypothesis property-based testing to verify that the Python LogParser and RustLogParser produce identical results for any input. Add hypothesis to the dev dependencies in pyproject.toml.

  Create Hypothesis strategies in test_log_parser.py that can generate:
  - Random timestamps in all 4 supported formats with arbitrary dates
  - Random log levels with/without brackets
  - Random field key=value pairs covering all types (integers, floats, booleans, unquoted strings, quoted strings with Unicode)
  - Complete valid log lines (timestamp + level + fields)
  - Noise lines (empty, whitespace, -- ... -- separators)
  - Full log files mixing valid lines, noise, and garbage text

  Exclude surrogate characters (Cs Unicode category) from all text strategies since they can't be written to files.

  Add a TestImplementationsAgree class (separate from the contract classes) with two tests, 500 examples each:
  1. test_both_implementations_agree -- generates a multi-line log file, parses with both implementations, asserts every entry matches (timestamp, level, fields, raw)
  2. test_valid_log_lines_agree -- generates a single valid log line, asserts both implementations produce identical output

  If any Hypothesis failure reveals a bug in the Rust implementation, fix it and re-run until all tests pass. After rebuilding Rust with uv run maturin develop, make sure the new .so is installed correctly before running tests (use uv pip install -e . --force-reinstall --no-build-isolation if uv run pytest picks up a stale binary).

Note the last paragraph. We're explicitly telling the AI to fix any bugs it finds. This is important because we expect Hypothesis to find things our initial tests missed.

Hypothesis strategies

The AI generates a set of composable strategies that can produce any valid (and invalid) log parser input:

from datetime import datetime
from hypothesis import given, settings, strategies as st

LEVELS = ["INFO", "ERROR", "WARN", "DEBUG", "TRACE", "FATAL"]

st_timestamp = st.one_of(
    st.builds(
        lambda dt: dt.strftime("%Y-%m-%dT%H:%M:%S.") + f"{dt.microsecond:06d}Z",
        st.datetimes(
            min_value=datetime(1970, 1, 1),
            max_value=datetime(2099, 12, 31),
        ),
    ),
    st.builds(
        lambda dt: dt.strftime("%Y-%m-%dT%H:%M:%SZ"),
        st.datetimes(
            min_value=datetime(1970, 1, 1),
            max_value=datetime(2099, 12, 31),
        ),
    ),
    st.builds(
        lambda dt: dt.strftime("%Y-%m-%d %H:%M:%S"),
        st.datetimes(
            min_value=datetime(1970, 1, 1),
            max_value=datetime(2099, 12, 31),
        ),
    ),
    st.builds(
        lambda dt: dt.strftime("%Y/%m/%d %H:%M:%S"),
        st.datetimes(
            min_value=datetime(1970, 1, 1),
            max_value=datetime(2099, 12, 31),
        ),
    ),
)

Each strategy generates one of the four supported timestamp formats with random dates. The same approach is used for levels, fields, and complete lines:

st_level = st.one_of(
    st.builds(lambda l: f"[{l}]", st.sampled_from(LEVELS)),
    st.sampled_from(LEVELS),
)

st_field_key = st.from_regex(r"[a-z][a-z0-9_]{0,15}", fullmatch=True)

st_field_value = st.one_of(
    st.integers(min_value=-999999, max_value=999999).map(str),
    st.floats(
        min_value=-1e6,
        max_value=1e6,
        allow_nan=False,
        allow_infinity=False,
    ).filter(lambda f: f != int(f)).map(lambda f: f"{f:.2f}"),
    st.sampled_from(["true", "false"]),
    st.text(
        alphabet=st.characters(
            whitelist_categories=("L", "N", "P"),
            blacklist_characters='"\\= \t\n{}',
        ),
        min_size=1,
        max_size=20,
    ),
    st.text(
        alphabet=st.characters(
            whitelist_categories=("L", "N", "P", "Z"),
            blacklist_characters='"\\\n',
        ),
        min_size=1,
        max_size=30,
    ).map(lambda s: f'"{s}"'),
)

st_field = st.builds(lambda k, v: f"{k}={v}", st_field_key, st_field_value)
st_fields = st.lists(st_field, min_size=0, max_size=5).map(lambda fs: " ".join(fs))

Notice the st_field_value strategy. It generates integers, floats, booleans, unquoted strings, and Unicode-quoted strings. This is the key detail - Unicode characters will exercise code paths our ASCII-only contract tests never touched.

The strategies compose into full log files:

st_log_line = st.builds(
    lambda ts, level, fields: f"{ts} {level} {fields}".rstrip(),
    st_timestamp,
    st_level,
    st_fields,
)

st_noise_line = st.one_of(
    st.just(""),
    st.just("   "),
    st.text(
        alphabet=st.characters(blacklist_characters="\n", blacklist_categories=("Cs",)),
        min_size=1,
        max_size=30,
    ).map(lambda s: f"-- {s} --"),
)

st_any_line = st.one_of(
    st_log_line,
    st_noise_line,
    st.text(
        alphabet=st.characters(
            whitelist_categories=("L",),
            blacklist_categories=("Cs",),
        ),
        min_size=1,
        max_size=20,
    ),
)

st_log_file = st.lists(st_any_line, min_size=1, max_size=15)

All text strategies exclude surrogate characters (Cs Unicode category) since those can't be written to files as UTF-8.

The property tests

The tests themselves are straightforward. They sit in a separate TestImplementationsAgree class - outside the contract hierarchy - because they aren't testing a single implementation against expected values. They're testing that two implementations agree with each other on arbitrary input:

class TestImplementationsAgree:
    @given(lines=st_log_file)
    @settings(max_examples=500)
    def test_both_implementations_agree(self, lines):
        path = _write_log_file(lines)
        try:
            py_entries = LogParser().load(path)
            rs_entries = RustLogParser().load(path)
            assert _entries_equal(py_entries, rs_entries), (
                f"Mismatch for input:\n{lines}\n"
                f"Python: {[(e.timestamp, e.level, e.fields) for e in py_entries]}\n"
                f"Rust:   {[(e.timestamp, e.level, e.fields) for e in rs_entries]}"
            )
        finally:
            os.unlink(path)

    @given(line=st_log_line)
    @settings(max_examples=500)
    def test_valid_log_lines_agree(self, line):
        path = _write_log_file([line])
        try:
            py_entries = LogParser().load(path)
            rs_entries = RustLogParser().load(path)
            assert _entries_equal(py_entries, rs_entries), (
                f"Mismatch for line: {line!r}\n"
                f"Python: {[(e.timestamp, e.level, e.fields) for e in py_entries]}\n"
                f"Rust:   {[(e.timestamp, e.level, e.fields) for e in rs_entries]}"
            )
        finally:
            os.unlink(path)

The core property: for any input, Python and Rust produce identical results.

Each test runs 500 randomly generated examples. That's 1000 inputs our parser has never seen before.

You can read more about property-based testing inside the article Testing in Python

The bug Hypothesis found

And this is where it gets interesting. Running the tests, Hypothesis immediately finds a bug in the Rust implementation.

If you look back at the parse_fields code from Part 3, you'll spot the problem. The function converted the input string to a Vec<char> (where index = character position) but then used &str methods like .find('=') which return byte positions:

let chars: Vec<char> = line.chars().collect();
// ...
let eq_pos = line[i..].find('=');  // byte position!
let key: String = chars[i..eq_pos].iter().collect();  // char position!

With ASCII-only input, byte positions and character positions are identical. Every ASCII character is exactly 1 byte. So all 37 contract tests passed.

But with multibyte Unicode characters (e.g., Ħ, which is 2 bytes, or 𨼄, which is 4 bytes), the positions diverge. .find('=') might return byte position 30, but character position 30 in the chars vector points to a completely different location. This caused either:

A panic: byte index 30 is not a char boundary; it is inside '𨼄' (bytes 27..31) - Rust's string slicing caught the invalid boundary
Silent wrong results: slicing at the wrong position, extracting the wrong key or value

The fix

The AI replaced all &str-based searching in parse_fields with a find_char helper that operates purely on the Vec<char>:

fn find_char(chars: &[char], start: usize, target: char) -> Option<usize> {
    for (offset, &ch) in chars[start..].iter().enumerate() {
        if ch == target {
            return Some(start + offset);
        }
    }
    None
}

Every place that previously did line[i..].find('=') (byte-indexed &str search) now does find_char(&chars, i, '=') (char-indexed Vec<char> search). This way, all indexing is consistent - i, eq_pos, start are always character positions into the chars vector, and strings are reconstructed with chars[start..end].iter().collect().

After the fix, all tests pass - including the 1000 Hypothesis-generated examples.

Why the contract tests didn't catch it

The 37 contract tests only use ASCII log lines. English service names, numeric IDs, and simple error messages. ASCII characters are always 1 byte = 1 char, so byte and char positions never diverge.

It took Hypothesis generating field values like x_Çb³5ŗĝŚ¡Z³² (multibyte characters mixed with ASCII) to trigger the mismatch.

This is exactly why property-based testing is valuable for a Python-to-Rust rewrite. Tests with hardcoded examples verify the paths you have in mind. Hypothesis explores the edges you don't. Even the most obscure cases.

Conclusion

This wraps up our series on rewriting Python to Rust with AI.

Let's recap the approach:

Part 1: We expanded test coverage so we actually know what behavior to preserve
Part 2: We restructured tests into contracts so both implementations share the same spec
Part 3: We implemented the Rust version and verified it against that spec
Part 4: We used property-based testing to find a real bug that initial tests missed

The key takeaway: tests are the backbone of any rewrite. Without them, you're comparing code by eye. With comprehensive, shared, and randomized tests, you get automated, definitive proof that both implementations behave identically.

All the techniques and approaches that we're using in this series are explained in detail inside Complete Python Testing Guide.

Until then, happy engineering!

AI-Powered Rewrite from Python to Rust - part 4