Support write to buffer api for SerializedFileWriter (#7714)

# Which issue does this PR close?

Currently, no pub api to support write the internal buffer for
SerializedFileWriter, it's very helpful when we want to add low level
API for example:
- https://github.com/apache/datafusion/issues/16374
- https://github.com/apache/datafusion/pull/16395

Because that we want to update the buf bytes written, if we use the buf
internal file to write, we can't update the internal buf written bytes.

The consistent update for the bytes written metrics is the key for our
custom index write.


# Rationale for this change

Add API to support write with buf byteswritten updating.

# What changes are included in this PR?

Add API to support write with buf byteswritten updating.

# Are there any user-facing changes?
No

If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
This commit is contained in:
Qi Zhu
2025-06-21 00:08:30 +08:00
committed by GitHub
parent 1bed04c1e0
commit fbaf7cea2d
2 changed files with 32 additions and 3 deletions
+14 -2
View File
@@ -297,6 +297,14 @@ impl<W: Write + Send> ArrowWriter<W> {
Ok(())
}
/// Writes the given buf bytes to the internal buffer.
///
/// It's safe to use this method to write data to the underlying writer,
/// because it will ensure that the buffering and bytecounting layers are used.
pub fn write_all(&mut self, buf: &[u8]) -> std::io::Result<()> {
self.writer.write_all(buf)
}
/// Flushes all buffered rows into a new row group
pub fn flush(&mut self) -> Result<()> {
let in_progress = match self.in_progress.take() {
@@ -326,8 +334,12 @@ impl<W: Write + Send> ArrowWriter<W> {
/// Returns a mutable reference to the underlying writer.
///
/// It is inadvisable to directly write to the underlying writer, doing so
/// will likely result in a corrupt parquet file
/// **Warning**: if you write directly to this writer, you will skip
/// the `TrackedWrite` buffering and bytecounting layers. Thatll cause
/// the file footers recorded offsets and sizes to diverge from reality,
/// resulting in an unreadable or corrupted Parquet file.
///
/// If you want to write safely to the underlying writer, use [`Self::write_all`].
pub fn inner_mut(&mut self) -> &mut W {
self.writer.inner_mut()
}
+18 -1
View File
@@ -394,9 +394,26 @@ impl<W: Write + Send> SerializedFileWriter<W> {
self.buf.inner()
}
/// Writes the given buf bytes to the internal buffer.
///
/// This can be used to write raw data to an in-progress parquet file, for
/// example, custom index structures or other payloads. Other parquet readers
/// will skip this data when reading the files.
///
/// It's safe to use this method to write data to the underlying writer,
/// because it will ensure that the buffering and bytecounting layers are used.
pub fn write_all(&mut self, buf: &[u8]) -> std::io::Result<()> {
self.buf.write_all(buf)
}
/// Returns a mutable reference to the underlying writer.
///
/// It is inadvisable to directly write to the underlying writer.
/// **Warning**: if you write directly to this writer, you will skip
/// the `TrackedWrite` buffering and bytecounting layers. Thatll cause
/// the file footers recorded offsets and sizes to diverge from reality,
/// resulting in an unreadable or corrupted Parquet file.
///
/// If you want to write safely to the underlying writer, use [`Self::write_all`].
pub fn inner_mut(&mut self) -> &mut W {
self.buf.inner_mut()
}