Skip to content

Allow compactions to delete non shared input files. #5387

@keith-turner

Description

@keith-turner

Is your feature request related to a problem? Please describe.

Many files referenced are only used by a single tablet and these files could be deleted by compaction if this was known. Instead a delete marker is always added for files and GC has to process this delete marker.

Describe the solution you'd like

Each files in a tablets metadata could have a shared marker that tracks if more than one tablet references the file.

  • When compaction creates a new files it sets shared=false
  • When a tablet splits it will set shared=true on any files that go to multiple tablets
  • When a table is cloned it will set shared=true in the source table on any files it references in the new table.
  • Bulk import could marks files as shared or not depending on if the files go to multiple tablets.
  • The fate operation that commits a compaction could either delete the input files or write a delete markers depending on if the files were shared or not.

For this feature to be possible all of the above operations must be able to be done safely using conditional mutations.

The shared marker could be added to the per file metadata that is already stored in the tablet.

Describe alternatives you've considered

#2729 may be an alternative if HDFS supports hard links.

Additional context

This feature would reduce the work on the Accumulo GC process and avoid storing delete markers. The trade off is that the new shared marker would be required and compaction commit would now be making calls to the namenode to delete files in some cases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementThis issue describes a new feature, improvement, or optimization.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions