mirror of
https://github.com/langchain-ai/arrow-rs.git
synced 2026-07-01 21:34:01 -04:00
Support write to buffer api for SerializedFileWriter (#7714)
# Which issue does this PR close? Currently, no pub api to support write the internal buffer for SerializedFileWriter, it's very helpful when we want to add low level API for example: - https://github.com/apache/datafusion/issues/16374 - https://github.com/apache/datafusion/pull/16395 Because that we want to update the buf bytes written, if we use the buf internal file to write, we can't update the internal buf written bytes. The consistent update for the bytes written metrics is the key for our custom index write. # Rationale for this change Add API to support write with buf byteswritten updating. # What changes are included in this PR? Add API to support write with buf byteswritten updating. # Are there any user-facing changes? No If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
This commit is contained in:
@@ -297,6 +297,14 @@ impl<W: Write + Send> ArrowWriter<W> {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Writes the given buf bytes to the internal buffer.
|
||||
///
|
||||
/// It's safe to use this method to write data to the underlying writer,
|
||||
/// because it will ensure that the buffering and byte‐counting layers are used.
|
||||
pub fn write_all(&mut self, buf: &[u8]) -> std::io::Result<()> {
|
||||
self.writer.write_all(buf)
|
||||
}
|
||||
|
||||
/// Flushes all buffered rows into a new row group
|
||||
pub fn flush(&mut self) -> Result<()> {
|
||||
let in_progress = match self.in_progress.take() {
|
||||
@@ -326,8 +334,12 @@ impl<W: Write + Send> ArrowWriter<W> {
|
||||
|
||||
/// Returns a mutable reference to the underlying writer.
|
||||
///
|
||||
/// It is inadvisable to directly write to the underlying writer, doing so
|
||||
/// will likely result in a corrupt parquet file
|
||||
/// **Warning**: if you write directly to this writer, you will skip
|
||||
/// the `TrackedWrite` buffering and byte‐counting layers. That’ll cause
|
||||
/// the file footer’s recorded offsets and sizes to diverge from reality,
|
||||
/// resulting in an unreadable or corrupted Parquet file.
|
||||
///
|
||||
/// If you want to write safely to the underlying writer, use [`Self::write_all`].
|
||||
pub fn inner_mut(&mut self) -> &mut W {
|
||||
self.writer.inner_mut()
|
||||
}
|
||||
|
||||
@@ -394,9 +394,26 @@ impl<W: Write + Send> SerializedFileWriter<W> {
|
||||
self.buf.inner()
|
||||
}
|
||||
|
||||
/// Writes the given buf bytes to the internal buffer.
|
||||
///
|
||||
/// This can be used to write raw data to an in-progress parquet file, for
|
||||
/// example, custom index structures or other payloads. Other parquet readers
|
||||
/// will skip this data when reading the files.
|
||||
///
|
||||
/// It's safe to use this method to write data to the underlying writer,
|
||||
/// because it will ensure that the buffering and byte‐counting layers are used.
|
||||
pub fn write_all(&mut self, buf: &[u8]) -> std::io::Result<()> {
|
||||
self.buf.write_all(buf)
|
||||
}
|
||||
|
||||
/// Returns a mutable reference to the underlying writer.
|
||||
///
|
||||
/// It is inadvisable to directly write to the underlying writer.
|
||||
/// **Warning**: if you write directly to this writer, you will skip
|
||||
/// the `TrackedWrite` buffering and byte‐counting layers. That’ll cause
|
||||
/// the file footer’s recorded offsets and sizes to diverge from reality,
|
||||
/// resulting in an unreadable or corrupted Parquet file.
|
||||
///
|
||||
/// If you want to write safely to the underlying writer, use [`Self::write_all`].
|
||||
pub fn inner_mut(&mut self) -> &mut W {
|
||||
self.buf.inner_mut()
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user