Skip to content

Delineate internal metadata fields with a boolean identifier #295

@binarylogic

Description

@binarylogic

This is the successor to #256

There is a complex matrix of possible behaviors when you introduce structured vs non-structured data types across different topologies. This has been discussed at length:

The purpose of this issue is to materialize the conclusion we came to today in Slack so that we can move forward with a number of other dependent issues, such as specifying the hostname key in various sinks, or specifying a target key when parsing a log message, etc.

Quick Background

  1. The issue was originally raised via [RFC] Drop the host & line keys in Record struct #155
  2. The "Structured vs Non-Structured" RFC was created to address this: https://www.notion.so/timber/Raw-vs-Structured-Data-for-Sinks-f4a73d250d88427ca677ea0954e3ea75
  3. This was implemented via Refactor Record and use bytes instead of String #204, which introduced a performance regression in the TCP source.
  4. Perf improvements #269 resolved this performance regression by moving host back to a defined Record field.
  5. This change caused a regression around how we handle the host key in the Splunk sink (or any sink that required host): Make splunk to use record host field #276
  6. Which then led to this comment Make splunk to use record host field #276 (comment)
  7. When then led to another long Slack discussion.

Solution

The solution today, which was proposed by @michaelfairley, is to delineate fields that were implicitly and explicitly set with a boolean. In other words, to change the structured map from string => string to string => (bool, string). This means we have one single map representing all structured data, with a simple boolean telling us if it was implicitly or explicitly set.

Examples

TCP -> TCP

[sources.in]
  type = "tcp"
  # ...

[sinks.out]
  inputs = ["in"]
  type = "tcp"
  # ...
  1. Input: "Hello word" raw text line.

  2. tcp (in) receives data and represents it as:

    {
      "timestamp" => (<timestamp>, false)
      "message" => ("Hello world", false),
      "host" => ("my.host.com", false)
    }
    

    Where false means that the data was implicitly set.

  3. tcp (out) is written:

    Hello world\n
    

    Only the raw "message" field is written to the TCP sink. This is because the data structured is recognized as "unstructured" since all values are false (implcitly set). This is is the default behavior for unsutrdcutred data in the TCP sink.

TCP -> JSON Parser -> TCP

[sources.in]
  type = "tcp"
  # ...

[transforms.json]
  inputs = ["in"]
  type = "parser"
  format = "json"
  # ...

[sinks.out]
  inputs = ["json"]
  type = "tcp"
  encoder = "json" # required
  # ...
  1. Input: '{"msg": "Hello word", "key": "val"}'

  2. tcp (in) receives data and represents it as:

    {
      "timestamp" => (<timestamp>, false)
      "message" => ('{"msg": "Hello word"}', false),
      "host" => ("my.host.com", false)
    }
    

    Where false means that the data was implicitly set.

  3. transform (json parser) transforms the data into:

    {
      "timestamp" => (<timestamp>, false)
      "message" => ('{"msg": "Hello word"}', false),
      "host" => ("my.host.com", false),
      "msg" => ("Hello world", true),
      "key" => ("val", true)
    }
    
  4. tcp (out) is written:

    {"msg": "Hello world", "key": "val"}
    

    You'll notice the tcp.out declaration includes a required encoder option since it is receiving structured data. This will be handled via Implement topology-aware validations #235. You'll also notice that metadata fields are not included by default. This is because these are internal/transparent fields that are only used when necessary or explicitly included (happy to hear arguments otherwise).

TCP -> Splunk

[sources.in]
  type = "tcp"
  # ...

[sinks.out]
  inputs = ["in"]
  type = "splunk"
  1. Input: "Hello word" raw text lin..

  2. tcp (in) receives data and represents it as:

    {
      "timestamp" => (<timestamp>, false)
      "message" => ("Hello world", false),
      "host" => ("my.host.com", false)
    }
    

    Where false means that the data was implicitly set.

  3. splunk (out) forwards the "message" but also specifies the host since Splunk requires this metadata. By default, the sink looks for the "host" key since this is one of our "common" keys, but the user can willingly change that by setting the host_field setting in the sources.out declaration.

Requirements Checklist

This is a checklist to ensure we're handling all of the little details that come with this change. If it helps, we can break these out into separate issues, because I would assume they'll be separate PRs.

  • Decide on reserved field names. Ex: timestamp, message, and host. Alternatively, we could namespace the keys like _timestamp, _message, and _host. Feel free to go with these or choose entirely different names, I'm indifferent.

Sources

  • file source adds a timestamp field to represent when the record was received
  • file source includes the host key for the local server.
  • file source includes a file_key config option to control the "file" context key name.
  • file source includes a host_key config option to control the "host" context key name.
  • The above file source behavior is tested.
  • syslog source adds a timestamp field to represent when the record was received
  • syslog source includes the host key when in tcp mode, this represents the host of the client.
  • syslog source includes the host key when in unix mode, this should be the local host.
  • syslog source includes a host_key config option to control the "host" context key name.
  • The above syslog source behavior is tested.
  • stdin source adds a timestamp field to represent when the record was received
  • stdin source includes the host key for the local server.
  • stdin source includes a host_key config option to control the "host" context key name.
  • The above stdin source behavior is tested.
  • tcp source adds a timestamp field to represent when the record was received
  • tcp source includes the host key for the remote server.
  • tcp source includes a host_key config option to control the "host" context key name.
  • The above tcp source behavior is tested.

Transforms

  • add_field transform sets any added fields as explicit.
  • json_parser transform decodes data and sets all decoded fields as explicit.
  • regex_parser transform sets any extracted fields as explicit.

Sinks

  • cloudwatch_logs sink forwards the message field only if the record is entirely implicitly structured.
  • cloudwatch_logs sink encodes data (explicit and implicit keys) as JSON, regardless if the entire map is implicitly structured.
  • cloudwatch_logs sink maps the timestamp field to Cloudwatch's timestamp field and drops that field before encoding the data.
  • console sink prints the message field if the record is entirely implicitly structured.
  • console sink encodes the data to JSON if the record is not entirely implicitly structured. This payload should include all keys.
  • console sink provides an encoding option with json, text. If left unspecified, the behavior is dynamic, choosing the encoding on a per record basis based on it's explicit structured state.
  • elasticsearch sink encodes data (explicit and implicit keys) as JSON, regardless if the entire map is implicitly structured.
  • http sink only includes the raw message if the record is entirely implicitly structured. This should be text/plain, new line delimited.
  • http sink encodes data (explicit and implicit keys) as JSON if the record is not entirely implicitly structured. This should be application/ndjson (new line delimited).
  • http sink encodes data (explicit and implicit keys) as JSON, regardless if the entire map is implicitly structured.
  • kafka sink only includes the raw message if the record is entirely implicitly structured.
  • kafka sink encodes data (explicit and implicit keys) as JSON if the record is not entirely implicitly structured.
  • kinesis sink only includes the raw message if the record is entirely implicitly structured.
  • kinesis sink encodes data (explicit and implicit keys) as JSON if the record is not entirely implicitly structured.
  • s3 sink only includes the raw message if the record is entirely implicitly structured.
  • s3 sink encodes data (explicit and implicit keys) as JSON if the record is not entirely implicitly structured.
  • splunk sink only includes the raw message if the record is entirely implicitly structured.
  • splunk sink maps the host field appropriately (should this also be dropped?)
  • kinesis sink encodes data (explicit and implicit keys) as JSON if the record is not entirely implicitly structured.
  • tcp sink only includes the raw message if the record is entirely implicitly structured.
  • tcp sink encodes data (explicit and implicit keys) as JSON if the record is not entirely implicitly structured.

Metadata

Metadata

Assignees

Labels

domain: data modelAnything related to Vector's internal data modeldomain: logsAnything related to Vector's log eventstype: enhancementA value-adding code change that enhances its existing functionality.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions