-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Delineate internal metadata fields with a boolean identifier #295
Description
This is the successor to #256
There is a complex matrix of possible behaviors when you introduce structured vs non-structured data types across different topologies. This has been discussed at length:
- Determine whether data has gone through explicit structuring #256
- https://www.notion.so/timber/Raw-vs-Structured-Data-for-Sinks-f4a73d250d88427ca677ea0954e3ea75
- And multiple times in Slack
The purpose of this issue is to materialize the conclusion we came to today in Slack so that we can move forward with a number of other dependent issues, such as specifying the hostname key in various sinks, or specifying a target key when parsing a log message, etc.
Quick Background
- The issue was originally raised via [RFC] Drop the
host&linekeys in Record struct #155 - The "Structured vs Non-Structured" RFC was created to address this: https://www.notion.so/timber/Raw-vs-Structured-Data-for-Sinks-f4a73d250d88427ca677ea0954e3ea75
- This was implemented via Refactor Record and use bytes instead of String #204, which introduced a performance regression in the TCP source.
- Perf improvements #269 resolved this performance regression by moving
hostback to a definedRecordfield. - This change caused a regression around how we handle the
hostkey in the Splunk sink (or any sink that required host): Make splunk to use record host field #276 - Which then led to this comment Make splunk to use record host field #276 (comment)
- When then led to another long Slack discussion.
Solution
The solution today, which was proposed by @michaelfairley, is to delineate fields that were implicitly and explicitly set with a boolean. In other words, to change the structured map from string => string to string => (bool, string). This means we have one single map representing all structured data, with a simple boolean telling us if it was implicitly or explicitly set.
Examples
TCP -> TCP
[sources.in]
type = "tcp"
# ...
[sinks.out]
inputs = ["in"]
type = "tcp"
# ...
-
Input:
"Hello word"raw text line. -
tcp (in)receives data and represents it as:{ "timestamp" => (<timestamp>, false) "message" => ("Hello world", false), "host" => ("my.host.com", false) }Where
falsemeans that the data was implicitly set. -
tcp (out)is written:Hello world\nOnly the raw
"message"field is written to the TCP sink. This is because the data structured is recognized as "unstructured" since all values arefalse(implcitly set). This is is the default behavior for unsutrdcutred data in the TCP sink.
TCP -> JSON Parser -> TCP
[sources.in]
type = "tcp"
# ...
[transforms.json]
inputs = ["in"]
type = "parser"
format = "json"
# ...
[sinks.out]
inputs = ["json"]
type = "tcp"
encoder = "json" # required
# ...
-
Input:
'{"msg": "Hello word", "key": "val"}' -
tcp (in)receives data and represents it as:{ "timestamp" => (<timestamp>, false) "message" => ('{"msg": "Hello word"}', false), "host" => ("my.host.com", false) }Where
falsemeans that the data was implicitly set. -
transform (json parser)transforms the data into:{ "timestamp" => (<timestamp>, false) "message" => ('{"msg": "Hello word"}', false), "host" => ("my.host.com", false), "msg" => ("Hello world", true), "key" => ("val", true) } -
tcp (out)is written:{"msg": "Hello world", "key": "val"}You'll notice the
tcp.outdeclaration includes a requiredencoderoption since it is receiving structured data. This will be handled via Implement topology-aware validations #235. You'll also notice that metadata fields are not included by default. This is because these are internal/transparent fields that are only used when necessary or explicitly included (happy to hear arguments otherwise).
TCP -> Splunk
[sources.in]
type = "tcp"
# ...
[sinks.out]
inputs = ["in"]
type = "splunk"
-
Input:
"Hello word"raw text lin.. -
tcp (in)receives data and represents it as:{ "timestamp" => (<timestamp>, false) "message" => ("Hello world", false), "host" => ("my.host.com", false) }Where
falsemeans that the data was implicitly set. -
splunk (out)forwards the"message"but also specifies thehostsince Splunk requires this metadata. By default, the sink looks for the"host"key since this is one of our "common" keys, but the user can willingly change that by setting thehost_fieldsetting in thesources.outdeclaration.
Requirements Checklist
This is a checklist to ensure we're handling all of the little details that come with this change. If it helps, we can break these out into separate issues, because I would assume they'll be separate PRs.
- Decide on reserved field names. Ex:
timestamp,message, andhost. Alternatively, we could namespace the keys like_timestamp,_message, and_host. Feel free to go with these or choose entirely different names, I'm indifferent.
Sources
-
filesource adds atimestampfield to represent when the record was received -
filesource includes thehostkey for the local server. -
filesource includes afile_keyconfig option to control the"file"context key name. -
filesource includes ahost_keyconfig option to control the"host"context key name. - The above
filesource behavior is tested. -
syslogsource adds atimestampfield to represent when the record was received -
syslogsource includes thehostkey when intcpmode, this represents the host of the client. -
syslogsource includes thehostkey when inunixmode, this should be the local host. -
syslogsource includes ahost_keyconfig option to control the"host"context key name. - The above
syslogsource behavior is tested. -
stdinsource adds atimestampfield to represent when the record was received -
stdinsource includes thehostkey for the local server. -
stdinsource includes ahost_keyconfig option to control the"host"context key name. - The above
stdinsource behavior is tested. -
tcpsource adds atimestampfield to represent when the record was received -
tcpsource includes thehostkey for the remote server. -
tcpsource includes ahost_keyconfig option to control the"host"context key name. - The above
tcpsource behavior is tested.
Transforms
-
add_fieldtransform sets any added fields as explicit. -
json_parsertransform decodes data and sets all decoded fields as explicit. -
regex_parsertransform sets any extracted fields as explicit.
Sinks
-
cloudwatch_logssink forwards themessagefield only if the record is entirely implicitly structured. -
cloudwatch_logssink encodes data (explicit and implicit keys) as JSON, regardless if the entire map is implicitly structured. -
cloudwatch_logssink maps thetimestampfield to Cloudwatch's timestamp field and drops that field before encoding the data. -
consolesink prints themessagefield if the record is entirely implicitly structured. -
consolesink encodes the data to JSON if the record is not entirely implicitly structured. This payload should include all keys. -
consolesink provides anencodingoption withjson,text. If left unspecified, the behavior is dynamic, choosing the encoding on a per record basis based on it's explicit structured state. -
elasticsearchsink encodes data (explicit and implicit keys) as JSON, regardless if the entire map is implicitly structured. -
httpsink only includes the rawmessageif the record is entirely implicitly structured. This should betext/plain, new line delimited. -
httpsink encodes data (explicit and implicit keys) as JSON if the record is not entirely implicitly structured. This should beapplication/ndjson(new line delimited). -
httpsink encodes data (explicit and implicit keys) as JSON, regardless if the entire map is implicitly structured. -
kafkasink only includes the rawmessageif the record is entirely implicitly structured. -
kafkasink encodes data (explicit and implicit keys) as JSON if the record is not entirely implicitly structured. -
kinesissink only includes the rawmessageif the record is entirely implicitly structured. -
kinesissink encodes data (explicit and implicit keys) as JSON if the record is not entirely implicitly structured. -
s3sink only includes the rawmessageif the record is entirely implicitly structured. -
s3sink encodes data (explicit and implicit keys) as JSON if the record is not entirely implicitly structured. -
splunksink only includes the rawmessageif the record is entirely implicitly structured. -
splunksink maps thehostfield appropriately (should this also be dropped?) -
kinesissink encodes data (explicit and implicit keys) as JSON if the record is not entirely implicitly structured. -
tcpsink only includes the rawmessageif the record is entirely implicitly structured. -
tcpsink encodes data (explicit and implicit keys) as JSON if the record is not entirely implicitly structured.