Skip to content

Create tserver.group, add group to location information#3482

Closed
dlmarion wants to merge 3 commits into
apache:elasticityfrom
dlmarion:3459-tserver-groups
Closed

Create tserver.group, add group to location information#3482
dlmarion wants to merge 3 commits into
apache:elasticityfrom
dlmarion:3459-tserver-groups

Conversation

@dlmarion
Copy link
Copy Markdown
Contributor

This change includes:

  1. Adding TSERV_GROUP_NAME property to add resource groups to TabletServers like has been done for Compactors and ScanServers
  2. Modified TServerInstance to include the group in the location information.
  3. Modified TabletMetadata to remove the session from the location columns colq and put all of the location information in the column value.
  4. Modified MetricsUtil to add the servers resource group as a tag to all emitted metrics
  5. Modified CachedTablet to use the TServerInstance instead of interned strings for the server and session.

1. Adding TSERV_GROUP_NAME property to add resource groups
   to TabletServers like has been done for Compactors and
   ScanServers
2. Modified TServerInstance to include the group in the
   location information.
3. Modified TabletMetadata to remove the session from the
   location columns colq and put all of the location
   information in the column value.
4. Modified MetricsUtil to add the servers resource group
   as a tag to all emitted metrics
5. Modified CachedTablet to use the TServerInstance instead
   of interned strings for the server and session.
@dlmarion dlmarion requested a review from keith-turner June 12, 2023 14:29
@dlmarion
Copy link
Copy Markdown
Contributor Author

This ended up being a bigger change than I expected. Follow-on work includes:

Rename compactor.queue
Update scripts for compactor.queue, addition of tserver.group
TabletServer resource group tests, to include balancers

mutation.put(getLocationFamily(location.getType()), location.getSession(),
location.getHostPort());
mutation.put(getLocationFamily(location.getType()), "",
location.getServerInstance().toString());
Copy link
Copy Markdown
Contributor Author

@dlmarion dlmarion Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be one of the biggest changes in this PR. It removes the sessionId from the ColQ of the location columns in the metadata and puts the <host:port>#<sessionId>#<resourceGroup> in the Value. I was wondering why the sessionId was in the colq to begin with, that seems like it would allow more than one entry for the location type in the metadata table.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Associated with this change, I ended up removing a test for https://issues.apache.org/jira/browse/ACCUMULO-1248 as it no longer applied - the test would not work.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if session id applies here, but my impression is that session id was used to detect client ZooKeeper re-connections. If a ZooKeeper session failed (did not respond with ZooKeeper tick time), but then established a "new" connection - the change in session id was detected to signal that there is a possibility that information in ZooKeeper could have changed while the session was "off-line". This may be a ZooCache centered view, but maybe it applied here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good information. I don't have the history to know why sessionId was in the column qualifier. It seemed like a bug to me as it would allow multiple current, future, or last entries.

Copy link
Copy Markdown
Contributor Author

@dlmarion dlmarion Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EdColeman - do you happen to know where the code is that detects a change in sessionId?

Edit: I think it's TabletMetadata.getTabletState(Set<TServerInstance>). That code should still work.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering why the sessionId was in the colq to begin with, that seems like it would allow more than one entry for the location type in the metadata table.

I do not know of a specific reason why it was structured that way. It did allow detection of certain bugs like if the manager set a location on a tablet that already had a location, but I don't think that is why it was structured that way. Using conditional mutations can help avoid these bugs instead of detecting them after the fact. We could take this a step further and push the current vs future designation into the value and have a single column for location.

Copy link
Copy Markdown
Contributor

@keith-turner keith-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not finished looking at this. Plan to look at it some more tomorrow.

this.session = session;
this.hostPort = hostAndPort.toString();
this.hostPortSession = hostPort + "[" + session + "]";
this.group = group;
Copy link
Copy Markdown
Contributor

@keith-turner keith-turner Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it expected that group can be null? If so is null expected to make it through serialization and deserialization?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the default value is ServiceDescriptor..DEFAULT_GROUP_NAME.

Comment thread core/src/main/java/org/apache/accumulo/core/metadata/TServerInstance.java Outdated
return tServerInstance.getHost();
}

public String getHostPort() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were these methods removed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because they were redundant. The calling code can just call the method directly on the TServerInstance returned by getServerInstance().

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, personally I liked some of these. Saw them as convenience methods that made common patterns a bit shorter. But I don't care that much if they are removed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it makes the calling code shorter to read. However, I saw them as a potential source for a bug if for some reason someone changed them.

Comment thread core/src/main/java/org/apache/accumulo/core/metadata/TServerInstance.java Outdated
return 0;
}
return this.getHostPortSession().compareTo(other.getHostPortSession());
return this.getHostPortSessionGroup().compareTo(other.getHostPortSessionGroup());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it expected that the group would never change for an instance? If it did change it would be problematic to include it in compare, hashCode, and equals methods for tserver instance.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The group would be passed to the process as an argument (-o tserver.group=foo), so it is expected to be static for the life of the instance. If the argument is not supplied, then the default value is used (ServiceDescriptor..DEFAULT_GROUP_NAME = default).

updateCache(locToCache, lcSession);
}
} else {
log.warn("Metadata tablet not found for row: {}", metadataRow);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case could happen when a metadata tablet is temporarily not hosted. Would not want to warn in that case.

Suggested change
log.warn("Metadata tablet not found for row: {}", metadataRow);
log.debug("Metadata tablet not found for row: {}", metadataRow);

"1.3.5"),
TABLE_ASSIGNMENT_GROUP("table.assignment.group", ServiceDescriptor.DEFAULT_GROUP_NAME,
PropertyType.STRING,
"Tablets for this table will be assigned to TabletServers that have a corresponding"
Copy link
Copy Markdown
Contributor

@keith-turner keith-turner Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement may not be true, it depends on the behavior of the balancer? Unless the balancer is limited to only assigning to tserver within the group?

We could limit balancing choices to a group, so a balancer could only chose from tservers within this group. That may require some changes to the SPI, not sure. Or this could be provided as information to the balancer, but it does not have to honor it. If the balancer does not have to honor it then would need to change the description.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless the balancer is limited to only assigning to tserver within the group?

I think we should do this given that there is a default resource group and the ability to create others.

te.setLocationOnce(val, LocationType.FUTURE, suppressLocationError);
break;
case LastLocationColumnFamily.STR_NAME:
te.last = Location.last(val, qual);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit of pushing the tservers group into the metadata table for each tablet?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be needed by the Manager to know whether or not a table's tablets are balanced in the correct group.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The manager could get the group information about each tserver w/o it being stored in each tablet.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may be correct, and it may already be done with the changes that I made to LiveTServerSet. I will investigate.

@dlmarion
Copy link
Copy Markdown
Contributor Author

Closing in favor of the implementation in #3496

@dlmarion dlmarion closed this Jun 26, 2023
@dlmarion dlmarion deleted the 3459-tserver-groups branch October 24, 2023 20:40
@ctubbsii ctubbsii added this to the 4.0.0 milestone Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

Resource Groups: Create TabletServer property like sserver.group that can be used for assignment / balancing

4 participants