Skip to content

Conversation

@shwstppr
Copy link
Contributor

@shwstppr shwstppr commented Aug 25, 2025

Description

This pull request refactors the TLS framing and buffer management in the Link class to improve correctness and maintainability, and updates the SSL context initialization to use TLS 1.3 for enhanced security. CloudStack uses a 4-byte header for TLS packets. Earlier, it was not sent within the TLS application data, which affected maintainability (simply using TLS1.3 without packet changes didn't work, and it resulted in errors like [1]) and the implementation of agent-server communication using a different language. The most important changes are grouped below.

TLS Framing and Buffer Management

  • Reworked the TLS buffer handling in Link.java, replacing legacy header and packet assembly logic with a more robust system using netBuffer, appBuffer, and an explicit headerBuffer for frame length management. This improves frame parsing and avoids buffer overflows.
  • Refactored the read and write logic: the read method now correctly assembles frames from TLS streams, handling buffer resizing and edge cases, while the doWrite method builds TLS packets with a 4-byte length header and payload, ensuring correct framing and handshake handling.
  • Simplified the message sending and writing logic by removing manual header prepending and using the new framing system; the write queue now contains only payload buffers, and the header is added during the TLS wrap process.

Security Improvements

  • Updated SSL context initialization in Link.java to use SSLUtils.getSSLContextWithLatestVersion(), ensuring that TLS 1.3 is used for all server, client, and management SSL contexts.
  • Added a new method getSSLContextWithLatestVersion() in SSLUtils.java, which returns an SSLContext instance for TLS 1.3.
[1] Error in agent-server connection with TLS1.3 without packet framing changes

2025-08-25 18:41:41,698 INFO [utils.nio.NioClient] (main:[]) (logid:) Connecting to 172.120.0.67:8250
2025-08-25 18:41:41,702 INFO [utils.nio.NioClient] (main:[]) (logid:) Connected to 172.120.0.67:8250
2025-08-25 18:41:41,704 INFO [utils.nio.Link] (main:[]) (logid:) Conf file found: /etc/cloudstack/agent/agent.properties
2025-08-25 18:41:41,941 INFO [utils.nio.NioClient] (main:[]) (logid:) SSL: Handshake done
2025-08-25 18:41:41,950 DEBUG [utils.nio.NioClient] (Agent-NioConnectionHandler-1:[]) (logid:) Location 1: Socket Socket[addr=/172.120.0.67,port=8250,localport=59004] closed on read. Probably -1 returned: Input record too big: max = 16709 len = 22679
2025-08-25 18:41:41,950 DEBUG [utils.nio.NioClient] (Agent-NioConnectionHandler-1:[]) (logid:) Closing socket Socket[addr=/172.120.0.67,port=8250,localport=59004]

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Logs from management server:

[root@qa1-main-kvm-c0c69556-kvm-mgmt1 ~]# tail -f /var/log/cloudstack/management/management-server.log | grep SSL
2025-08-28 11:49:43,597 TRACE  [c.c.u.n.NioServer] (AgentManager-SSLHandshakeHandler-1:[]) (logid:) SSL: Handshake done with /172.120.0.188:34740 protocol: TLSv1.3, cipher suite: TLS_AES_256_GCM_SHA384
2025-08-28 11:49:43,677 TRACE  [c.c.u.n.NioServer] (AgentManager-SSLHandshakeHandler-2:[]) (logid:) SSL: Handshake done with /172.120.0.156:37860 protocol: TLSv1.3, cipher suite: TLS_AES_256_GCM_SHA384
2025-08-28 11:49:43,741 TRACE  [c.c.u.n.NioServer] (AgentManager-SSLHandshakeHandler-3:[]) (logid:) SSL: Handshake done with /172.120.1.143:44026 protocol: TLSv1.3, cipher suite: TLS_AES_256_GCM_SHA384
2025-08-28 11:49:43,781 TRACE  [c.c.u.n.NioServer] (AgentManager-SSLHandshakeHandler-4:[]) (logid:) SSL: Handshake done with /172.120.1.227:36560 protocol: TLSv1.3, cipher suite: TLS_AES_256_GCM_SHA384

Logs from one of the host:

[root@qa1-main-kvm-c0c69556-kvm-host1 ~]# tail -f /var/log/cloudstack/agent/agent.log | grep SSL
2025-08-28 11:49:43,673 INFO  [utils.nio.NioClient] (Agent-Handler-3:[]) (logid:) SSL: Handshake done with /172.120.0.67:8250 protocol: TLSv1.3, cipher suite: TLS_AES_256_GCM_SHA384

Communication with hosts, system VMs and MS seemed fine

How did you try to break this feature and the system with this change?

This pull request refactors the TLS framing and buffer management in the `Link` class to improve correctness and maintainability, and updates the SSL context initialization to use TLS 1.3 for enhanced security. CloudStack uses a 4-byte header for TLS packets. Earlier, it was not sent within the TLS application data, which affected maintainability and the implementation of agent-server communication using a different language. The most important changes are grouped below.

* Reworked the TLS buffer handling in `Link.java`, replacing legacy header and packet assembly logic with a more robust system using `netBuffer`, `appBuffer`, and an explicit `headerBuffer` for frame length management. This improves frame parsing and avoids buffer overflows.
* Refactored the read and write logic: the `read` method now correctly assembles frames from TLS streams, handling buffer resizing and edge cases, while the `doWrite` method builds TLS packets with a 4-byte length header and payload, ensuring correct framing and handshake handling.
* Simplified the message sending and writing logic by removing manual header prepending and using the new framing system; the write queue now contains only payload buffers, and the header is added during the TLS wrap process.

* Updated SSL context initialization in `Link.java` to use `SSLUtils.getSSLContextWithLatestVersion()`, ensuring that TLS 1.3 is used for all server, client, and management SSL contexts.
* Added a new method `getSSLContextWithLatestVersion()` in `SSLUtils.java`, which returns an `SSLContext` instance for TLS 1.3.

Signed-off-by: Abhishek Kumar <[email protected]>
@codecov
Copy link

codecov bot commented Aug 25, 2025

Codecov Report

❌ Patch coverage is 61.97183% with 54 lines in your changes missing coverage. Please review.
✅ Project coverage is 17.76%. Comparing base (c465caf) to head (f847d10).

Files with missing lines Patch % Lines
utils/src/main/java/com/cloud/utils/nio/Link.java 59.70% 38 Missing and 16 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #11503      +/-   ##
============================================
- Coverage     17.76%   17.76%   -0.01%     
+ Complexity    15859    15858       -1     
============================================
  Files          5923     5923              
  Lines        530470   530495      +25     
  Branches      64823    64825       +2     
============================================
- Hits          94243    94239       -4     
- Misses       425682   425712      +30     
+ Partials      10545    10544       -1     
Flag Coverage Δ
uitests 3.57% <ø> (ø)
unittests 18.85% <61.97%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@shwstppr
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 14723

Signed-off-by: Abhishek Kumar <[email protected]>
@apache apache deleted a comment from shwstppr Aug 29, 2025
@apache apache deleted a comment from blueorangutan Aug 29, 2025
@apache apache deleted a comment from blueorangutan Aug 29, 2025
@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@shwstppr
Copy link
Contributor Author

@blueorangutan test

@blueorangutan
Copy link

@shwstppr a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-14133)
Environment: kvm-ol8 (x2), zone: Advanced Networking with Mgmt server ol8
Total time taken: 418823 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr11503-t14133-kvm-ol8.zip
Smoke tests completed. 135 look OK, 11 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestClusterDRS>:setup Error 0.00 test_cluster_drs.py
test_01_internallb_roundrobin_1VPC_3VM_HTTP_port80 Error 6208.17 test_internal_lb.py
ContextSuite context=TestIpv4Routing>:setup Error 0.00 test_ipv4_routing.py
test_01_create_iso_with_checksum_sha1 Error 66.53 test_iso.py
test_03_create_iso_with_checksum_md5 Error 66.52 test_iso.py
test_list_system_vms_metrics_history Failure 0.25 test_metrics_api.py
test_list_vms_metrics_admin Error 3605.09 test_metrics_api.py
test_list_vms_metrics_history Error 5.54 test_metrics_api.py
test_01_vpn_usage Error 1.10 test_usage.py
test_01_scale_up_verify Failure 576.75 test_vm_autoscaling.py
test_02_update_vmprofile_and_vmgroup Failure 370.82 test_vm_autoscaling.py
test_07_autoscaling_vmgroup_on_vpc_network Failure 734.97 test_vm_autoscaling.py
test_07_autoscaling_vmgroup_on_vpc_network Error 734.99 test_vm_autoscaling.py
test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL Failure 6953.64 test_vpc_redundant.py
test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL Error 6954.10 test_vpc_redundant.py
test_02_redundant_VPC_default_routes Failure 8064.42 test_vpc_redundant.py
test_02_redundant_VPC_default_routes Error 8064.96 test_vpc_redundant.py
test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers Failure 8298.88 test_vpc_redundant.py
test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers Error 8299.49 test_vpc_redundant.py
test_04_rvpc_network_garbage_collector_nics Failure 8559.59 test_vpc_redundant.py
test_04_rvpc_network_garbage_collector_nics Error 8560.14 test_vpc_redundant.py
test_05_rvpc_multi_tiers Failure 9811.76 test_vpc_redundant.py
test_05_rvpc_multi_tiers Error 9812.64 test_vpc_redundant.py
test_01_VPC_nics_after_destroy Failure 4954.75 test_vpc_router_nics.py
test_02_VPC_default_routes Failure 5398.62 test_vpc_router_nics.py
test_01_redundant_vpc_site2site_vpn Failure 8479.78 test_vpc_vpn.py
test_01_redundant_vpc_site2site_vpn Error 8480.33 test_vpc_vpn.py
test_01_vpc_site2site_vpn_multiple_options Failure 5471.38 test_vpc_vpn.py
test_01_vpc_site2site_vpn_multiple_options Error 5471.82 test_vpc_vpn.py
test_01_vpc_site2site_vpn Failure 5990.11 test_vpc_vpn.py
test_01_vpc_site2site_vpn Error 5990.47 test_vpc_vpn.py
test_hostha_enable_ha_when_host_in_maintenance Error 305.97 test_hostha_kvm.py

@blueorangutan
Copy link

[LL] Trillian Build Failed (tid-7129)

@shwstppr
Copy link
Contributor Author

shwstppr commented Nov 9, 2025

@blueorangutan package

@blueorangutan
Copy link

@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 15690

@shwstppr
Copy link
Contributor Author

shwstppr commented Nov 9, 2025

@blueorangutan test

@blueorangutan
Copy link

@shwstppr a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@shwstppr
Copy link
Contributor Author

@blueorangutan test

@blueorangutan
Copy link

@shwstppr a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@shwstppr
Copy link
Contributor Author

Some issue with smoke test runs. Will investigate and make the required fixes

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR upgrades the CloudStack agent-server communication from TLS 1.2 to TLS 1.3 and refactors packet framing to include the 4-byte length header within TLS application data rather than outside it. This improves maintainability and enables proper TLS 1.3 operation. The changes include significant rewrites to buffer management in the Link class and updates to SSL context initialization.

Key Changes:

  • Introduced getSSLContextWithLatestProtocolVersion() method to support TLS 1.3 contexts
  • Refactored Link class buffer management to use explicit headerBuffer, netBuffer, and appBuffer with frame-aware reading logic
  • Modified doWrite to prepend the 4-byte frame header within TLS application data before encryption

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
utils/src/main/java/org/apache/cloudstack/utils/security/SSLUtils.java Adds new method to create TLS 1.3 SSL contexts
utils/src/main/java/com/cloud/utils/nio/NioConnection.java Enhances SSL handshake logging with protocol and cipher suite details
utils/src/main/java/com/cloud/utils/nio/NioClient.java Enhances SSL handshake logging with protocol and cipher suite details
utils/src/main/java/com/cloud/utils/nio/Link.java Major refactor: updates buffer management, frame parsing, read/write logic, and switches all SSL contexts to TLS 1.3

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

appBuffer.compact();
break;
case BUFFER_OVERFLOW:
appBuffer = enlargeBuffer(appBuffer, _sslEngine.getSession().getApplicationBufferSize());
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BUFFER_OVERFLOW case allocates a new buffer without preserving data from appBuffer. When overflow occurs during unwrap, appBuffer may contain unconsumed application data from a previous compact operation (line 261). The enlargeBuffer call discards this data, corrupting the frame assembly. The data should be preserved by flipping appBuffer, copying to the new buffer, and then compacting.

Suggested change
appBuffer = enlargeBuffer(appBuffer, _sslEngine.getSession().getApplicationBufferSize());
// Preserve existing data in appBuffer while enlarging it for the next unwrap
appBuffer.flip();
int requiredCapacity = _sslEngine.getSession().getApplicationBufferSize() + appBuffer.remaining();
ByteBuffer newAppBuffer = ByteBuffer.allocate(requiredCapacity);
newAppBuffer.put(appBuffer);
appBuffer = newAppBuffer;
appBuffer.compact();

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +75
public static SSLContext getSSLContextWithLatestProtocolVersion() throws NoSuchAlgorithmException {
return SSLContext.getInstance("TLSv1.3");
}
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method name in the PR description says 'getSSLContextWithLatestVersion()' but the actual implementation uses 'getSSLContextWithLatestProtocolVersion()'. The description should be updated to match the implementation, or vice versa.

Copilot uses AI. Check for mistakes.
dataRemaining -= count;
if (res.getHandshakeStatus() == HandshakeStatus.NEED_UNWRAP) {
// Unusual during application writes; upper layer should drive handshake
break;
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When NEED_UNWRAP occurs during write, the method breaks out of the loop without completing the write operation. This leaves some data in appSeq unsent, but the caller assumes the write succeeded. During renegotiation, NEED_UNWRAP can occur during wrap, and the handshake must be driven to completion before continuing the application data write. Consider throwing an exception or implementing handshake handling to ensure data is not silently dropped.

Suggested change
break;
throw new IOException("SSLEngine reported NEED_UNWRAP during write; renegotiation during write is not supported and application data may remain unsent");

Copilot uses AI. Check for mistakes.
public static SSLContext getSSLContext() throws NoSuchAlgorithmException {
return SSLContext.getInstance("TLSv1.2");
}

Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new method getSSLContextWithLatestProtocolVersion lacks documentation. Add a Javadoc comment explaining that it returns an SSLContext configured for TLS 1.3, and clarify when this should be used versus the TLS 1.2 method to help maintainers understand the distinction.

Suggested change
/**
* Returns an {@link SSLContext} configured for the latest supported TLS protocol version (currently TLSv1.3).
* <p>
* Use this method when both client and server support TLSv1.3 and you want to take advantage of the
* latest protocol improvements. For environments that require or are limited to TLSv1.2, use
* {@link #getSSLContext()} instead.
*
* @return an {@code SSLContext} instance using the {@code TLSv1.3} protocol
* @throws NoSuchAlgorithmException if the TLSv1.3 protocol is not available in the current environment
*/

Copilot uses AI. Check for mistakes.
Comment on lines 129 to +130
int totalLen = 0;
for (ByteBuffer buffer : buffers) {
totalLen += buffer.limit();
}

int processedLen = 0;
while (processedLen < totalLen) {
headBuf.clear();
pkgBuf.clear();
engResult = sslEngine.wrap(buffers, pkgBuf);
if (engResult.getHandshakeStatus() != HandshakeStatus.FINISHED && engResult.getHandshakeStatus() != HandshakeStatus.NOT_HANDSHAKING &&
engResult.getStatus() != SSLEngineResult.Status.OK) {
throw new IOException("SSL: SSLEngine return bad result! " + engResult);
for (ByteBuffer b : buffers) totalLen += b.remaining();
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The totalLen calculation could overflow if the sum of buffer.remaining() across all buffers exceeds Integer.MAX_VALUE. While unlikely in practice (would require >2GB of buffered data), an overflow would result in a corrupted negative frame length being written to the header. Consider validating totalLen is positive after the loop, or using long arithmetic with a range check.

Copilot uses AI. Check for mistakes.
Comment on lines +228 to +232
frameRemaining = headerBuffer.getInt();
headerBuffer.clear();
if (frameRemaining < 0) {
throw new IOException("Negative frame length");
}
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The frame length is read as a signed integer without an upper bound check. A malicious or corrupted header could specify a very large frame length (e.g., Integer.MAX_VALUE = ~2GB), causing an OutOfMemoryError at line 234 when allocating plainTextBuffer. Add a maximum frame size check (e.g., 16MB) to prevent resource exhaustion attacks.

Copilot uses AI. Check for mistakes.
@shwstppr shwstppr added this to the 4.23.0 milestone Jan 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants