A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
Vector pushes only small files to s3 and the configuration of the buffer and batch properties seem to have no effect.
Expectation: If the 'batch.max_events' and 'batch.timeout_secs' properties are set, I expect vector to push files to s3 having that max_event size or wait until the timeout is reached.
What I experience is, that it does not matter what I configure here, there are hundreds of files pushed every minute and they have a size of 300-500KB including 400-600 records per file. I tried to increase and decrease the batch properties and also the buffer properties but nothing changed.
We consume the data from a Kafka topic that has 60 partitions. Does the batching maybe depend on the Kafka partitioning?
Please see the config below. If this is not a bug but I made a configuration mistake please let me know.
Thanks for your help.
Configuration
data_dir = "/data/vector"
[sources.kafka_in_aiven]
type = "kafka"
topics = ["testTopic"]
bootstrap_servers = "some-bootstrap-servers"
group_id = "kafka-test-ingester"
auto_offset_reset = "latest"
tls.enabled = true
... security configs ...
decoding.codec = "json"
[transforms.timestamp_to_ingestion]
type = "remap"
inputs = ["kafka_in_aiven"]
source = '''
.timestamp = now()
'''
[sinks.s3]
type = "aws_s3"
inputs = [ "timestamp_to_ingestion" ]
encoding.codec = "json"
bucket = "some-bucket-path"
key_prefix = "data-path/year=%Y/month=%m/day=%d/hour=%H/"
auth.assume_role = "aws-arn"
filename_extension = "json.gz"
storage_class = "ONEZONE_IA"
batch.timeout_secs = 120
batch.max_events = 25000
buffer.max_events = 50000
framing.method = "newline_delimited
Version
with k8s and docker: timberio/vector:0.24.0-distroless-libc
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response
A note for the community
Problem
Vector pushes only small files to s3 and the configuration of the buffer and batch properties seem to have no effect.
Expectation: If the 'batch.max_events' and 'batch.timeout_secs' properties are set, I expect vector to push files to s3 having that max_event size or wait until the timeout is reached.
What I experience is, that it does not matter what I configure here, there are hundreds of files pushed every minute and they have a size of 300-500KB including 400-600 records per file. I tried to increase and decrease the batch properties and also the buffer properties but nothing changed.
We consume the data from a Kafka topic that has 60 partitions. Does the batching maybe depend on the Kafka partitioning?
Please see the config below. If this is not a bug but I made a configuration mistake please let me know.
Thanks for your help.
Configuration
Version
with k8s and docker: timberio/vector:0.24.0-distroless-libc
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response