Files
datafusion/.gitattributes
T
Chris Connelly 2587df09c3 Support newlines_in_values CSV option (#11533)
* feat!: support `newlines_in_values` CSV option

This significantly simplifies the UX when dealing with large CSV files
that must support newlines in (quoted) values. By default, large CSV
files will be repartitioned into multiple parallel range scans. This is
great for performance in the common case but when large CSVs contain
newlines in values the parallel scan will fail due to splitting on
newlines within quotes rather than actual line terminators.

With the current implementation, this behaviour can be controlled by the
session-level `datafusion.optimizer.repartition_file_scans` and
`datafusion.optimizer.repartition_file_min_size` settings.

This commit introduces a `newlines_in_values` option to `CsvOptions` and
plumbs it through to `CsvExec`, which includes it in the test for whether
parallel execution is supported. This provides a convenient and
searchable way to disable file scan repartitioning on a per-CSV basis.

BREAKING CHANGE: This adds new public fields to types with all public
fields, which is a breaking change.

* docs: normalise `newlines_in_values` documentation

* test: add/fix sqllogictests for `newlines_in_values`

* docs: document `datafusion.catalog.newlines_in_values`

* fix: typo in config.md

* chore: suppress lint on too many arguments for `CsvExec::new`

* fix: always checkout `*.slt` with LF line endings

This is a bit of a stab in the dark, but it might fix multiline tests on
Windows.

* fix: always checkout `newlines_in_values.csv` with `LF` line endings

The default git behaviour of converting line endings for checked out files causes the `csv_files.slt` test to fail when testing `newlines_in_values`. This appears to be due to the quoted newlines being converted to CRLF, which are not then normalised when the CSV is read. Assuming that the sqllogictests do normalise line endings in the expected output, this could then lead to a "spurious" diff from the actual output.

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2024-07-21 19:42:03 -04:00

5 lines
204 B
Plaintext

.github/ export-ignore
datafusion/core/tests/data/newlines_in_values.csv text eol=lf
datafusion/proto/src/generated/prost.rs linguist-generated
datafusion/proto/src/generated/pbjson.rs linguist-generated