Type:
issue
Question/Problem:
Upgrading from a pre-6.2.x version to a 6.2.x or later version caused slow performance and cluster access issues for a few hours.
Symptoms:
Users may notice connectivity issues between clients and resources, and admins may observe issues with agents (k8s, db, ssh...etc) having trouble connecting to or staying connected to the cluster.
Additionally, admins may observe a significant spike in reads/writes on the DynamoDB table that coincides with the upgrade. As a result of this, some DynamoDB capacity limits may be hit temporarily and cause a choke on reads/writes.
Logs:
2021-10-14T14:00:00.000+0000 teleportauth [kern.err] /usr/local/bin/teleport[27924]: 2021-09-10T14:59:59Z ERRO [AUTH:1] "Failed to retrieve client pool. Client cluster teleport-example, target cluster teleport-example, error:
ERROR REPORT:
Original Error: *trace.NotFoundError "/authorities/host/pcloud-teleport" is not found
Stack Trace:
/go/src/github.com/gravitational/teleport/lib/backend/dynamo/dynamodbbk.go:857 github.com/gravitational/teleport/lib/backend/dynamo
(*Backend).getKey\n\t/go/src/github.com/gravitational/teleport/lib/backend/dynamo/dynamodbbk.go:453 github.com/gravitational/teleport/lib/backend/dynamo.
(*Backend).Get\n\t/go/src/github.com/gravitational/teleport/lib/backend/sanitize.go:97 github.com/gravitational/teleport/lib/backend.
(*Sanitizer).Get\n\t/go/src/github.com/gravitational/teleport/lib/backend/report.go:161 github.com/gravitational/teleport/lib/backend.
(*Reporter).Get\n\t/go/src/github.com/gravitational/teleport/lib/services/local/trust.go:207 github.com/gravitational/teleport/lib/services/local.
(*CA).GetCertAuthority\n\t/go/src/github.com/gravitational/teleport/lib/cache/cache.go:977 github.com/gravitational/teleport/lib/cache.
(*Cache).GetCertAuthority\n\t/go/src/github.com/gravitational/teleport/lib/auth/middleware.go:571 github.com/gravitational/teleport/lib/auth.ClientCertPool\n\t/go/src/github.com/gravitational/teleport/lib/auth/middleware.go:265 github.com/gravitational/teleport/lib/auth.
(*TLSServer).GetConfigForClient\n\t/opt/go/src/crypto/tls/handshake_server.go:141 crypto/tls.
(*Conn).readClientHello\n\t/opt/go/src/crypto/tls/handshake_server.go:40 crypto/tls.
(*Conn).serverHandshake\n\t/opt/go/src/crypto/tls/conn.go:1362 crypto/tls.
(*Conn).Handshake\n\t/go/src/github.com/gravitational/teleport/lib/multiplexer/tls.go:144 github.com/gravitational/teleport/lib/multiplexer.
(*TLSListener).detectAndForward\n\t/opt/go/src/runtime/asm_amd64.s:1374 runtime.goexit\nUser Message: "/authorities/host/pcloud-teleport" is not found." auth/middleware.go:271
Repro Steps:
Upgrade auth server from a pre-6.2.x version of Teleport to a post-6.2.x version of Teleport.
Solution:
Starting with 6.2 the events backend indexing strategy changed and a data migration is triggered after upgrade. For optimal performance it is recommended that this migration be performed with only one auth server online. This migration may take some time, and depending on the size of cluster and number of events that were previously logged, may cause a significant temporarily increase in DynamoDB reads and writes.
This increase in reads and writes, when combined with hard capacity limits on the DyanmoDB tables, can cause a 'choke point' of sorts which can impact connectivity between users, the cluster, and agent resources as the auth server may be temporarily prevented from accessing the backend properly.
The recommended solution is to significantly increase read/write limits during the migration process and perform the migration and upgrade with only one auth server online.
Comments
0 comments
Article is closed for comments.