Skip to content

fix(disk buffer): recover from decode errors during initialization seek#25691

Open
apurvanisal5 wants to merge 1 commit into
vectordotdev:masterfrom
apurvanisal5:log-9468-buffer-decode-recovery
Open

fix(disk buffer): recover from decode errors during initialization seek#25691
apurvanisal5 wants to merge 1 commit into
vectordotdev:masterfrom
apurvanisal5:log-9468-buffer-decode-recovery

Conversation

@apurvanisal5

Copy link
Copy Markdown

Problem

When a disk buffer (v2) contains a record that fails protobuf decode on restart,
seek_to_next_record() during buffer initialization returns an error immediately
because ReaderError::Decode was not classified as a "bad read".

Customer-visible symptom:

error occurred when building buffer failed to seek to position where reader left off failed to decoded record: InvalidProtobufPayload

This can cause collectors to CrashLoop until the buffer directory is manually deleted.

Solution

Add ReaderError::Decode to is_bad_read() so decode failures during
initialization seek follow the same recovery path as checksum, deserialization,
and partial_write errors.

Test

  • Added reader_recovers_from_decode_error_during_initialization_seek
  • cargo test -p vector-buffers disk_v2 — all pass

References

@apurvanisal5 apurvanisal5 requested a review from a team as a code owner June 28, 2026 21:18
@github-actions

github-actions Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@datadog-vectordotdev

This comment has been minimized.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 02dc8b9b1b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

self,
ReaderError::Checksum { .. }
| ReaderError::Deserialization { .. }
| ReaderError::Decode { .. }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep decode failures from rolling the data file

When a record's payload fails to decode but its length, archive, and checksum are valid, only that record has been consumed and the reader can continue to the next length-delimited record while gap accounting handles the missing ID. Adding Decode to is_bad_read() also affects the normal BufferReader::next() path, which calls roll_to_next_data_file() for any bad read; a single undecodable record in the middle of the current file therefore abandons later valid, unacknowledged records in that same file (the new initialization test creates this layout but only checks that reopen succeeds). This turns the startup recovery into avoidable event loss or a stalled reader for buffers with mixed bad/good records in one data file.

Useful? React with 👍 / 👎.

@apurvanisal5

Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

@apurvanisal5

Copy link
Copy Markdown
Author

recheck

Add ReaderError::Decode to is_bad_read() so seek_to_next_record during
buffer startup skips decode failures like checksum and partial_write errors,
instead of failing topology build with InvalidProtobufPayload.
@apurvanisal5 apurvanisal5 force-pushed the log-9468-buffer-decode-recovery branch from 02dc8b9 to c7d57e2 Compare June 28, 2026 21:33

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c7d57e284f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

self,
ReaderError::Checksum { .. }
| ReaderError::Deserialization { .. }
| ReaderError::Decode { .. }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Don't consume the next record after a seek-time decode error

When a decode error occurs during seek_to_next_record, next() has already consumed that framed record but returns before track_read advances last_reader_record_id (the decode happens at reader.read_record(token)?). Classifying Decode as a bad read makes the seek loop swallow the error and call next() again; with the new test's layout (bad record #1, good record #2, ledger_last = 1), the loop then reads and drops record #2 during initialization even though it was the first unacknowledged record, so the buffer reopens with user data silently skipped.

Useful? React with 👍 / 👎.

@apurvanisal5

Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

@apurvanisal5

Copy link
Copy Markdown
Author

recheck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant