解决:将集群升级到2.2.16,并且让业务 批量删除和写入数据。 ※ find no available rootcoord, check rootcoord state 报错:
[2024/09/26 08:19:14.956 +00:00] [ERROR] [grpcclient/client.go:158] ["failed to get client address"] [error="find no available rootcoord, check rootcoord state"] [stack="github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).connect/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:158github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).GetGrpcClient/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:131github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).callOnce/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:256github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:312github.com/milvus-io/milvus/internal/distributed/rootcoord/client.(*Client).GetComponentStates/go/src/github.com/milvus-io/milvus/internal/distributed/rootcoord/client/client.go:129github.com/milvus-io/milvus/internal/util/funcutil.WaitForComponentStates.func1/go/src/github.com/milvus-io/milvus/internal/util/funcutil/func.go:65github.com/milvus-io/milvus/internal/util/retry.Do/go/src/github.com/milvus-io/milvus/internal/util/retry/retry.go:42github.com/milvus-io/milvus/internal/util/funcutil.WaitForComponentStates/go/src/github.com/milvus-io/milvus/internal/util/funcutil/func.go:89github.com/milvus-io/milvus/internal/util/funcutil.WaitForComponentHealthy/go/src/github.com/milvus-io/milvus/internal/util/funcutil/func.go:104github.com/milvus-io/milvus/internal/distributed/datanode.(*Server).init/go/src/github.com/milvus-io/milvus/internal/distributed/datanode/service.go:275github.com/milvus-io/milvus/internal/distributed/datanode.(*Server).Run/go/src/github.com/milvus-io/milvus/internal/distributed/datanode/service.go:172github.com/milvus-io/milvus/cmd/components.(*DataNode).Run/go/src/github.com/milvus-io/milvus/cmd/components/data_node.go:51github.com/milvus-io/milvus/cmd/roles.runComponent[...].func1/go/src/github.com/milvus-io/milvus/cmd/roles/roles.go:102"]
复制代码
问题:rootcoord和其他pod通信出现了问题。 解决:先重建rootcoord,再依次重建相关的querynode、indexnode、queryrecord、indexrecord。 ※ 页面查询报错
(Search 372 failed, reason Timestamp lag too large lag)
[2024/09/26 09:14:13.063 +00:00] [WARN] [retry/retry.go:44] ["retry func failed"] ["retry time"=0] [error="Search 372 failed, reason Timestamp lag too large lag(28h44m48.341s) max(24h0m0s) err %!w(<nil>)"][2024/09/26 09:14:13.063 +00:00] [WARN] [proxy/task_search.go:529] ["QueryNode search result error"] [traceID=62505beaa974c903] [msgID=452812354979102723] [nodeID=372] [reason="Search 372 failed, reason Timestamp lag too large lag(28h44m48.341s) max(24h0m0s) err %!w(<nil>)"][2024/09/26 09:14:13.063 +00:00] [WARN] [proxy/task_policies.go:132] ["failed to do query with node"] [traceID=62505beaa974c903] [nodeID=372] [dmlChannels="[by-dev-rootcoord-dml_6_442659379752037218v0,by-dev-rootcoord-dml_7_442659379752037218v1]"] [error="code: UnexpectedError, error: fail to Search, QueryNode ID=372, reason=Search 372 failed, reason Timestamp lag too large lag(28h44m48.341s) max(24h0m0s) err %!w(<nil>)"][2024/09/26 09:14:13.063 +00:00] [WARN] [proxy/task_policies.go:159] ["retry another query node with round robin"] [traceID=62505beaa974c903] [Nexts="{"by-dev-rootcoord-dml_6_442659379752037218v0":-1,"by-dev-rootcoord-dml_7_442659379752037218v1":-1}"][2024/09/26 09:14:13.063 +00:00] [WARN] [proxy/task_policies.go:60] ["no shard leaders were available"] [traceID=62505beaa974c903] [channel=by-dev-rootcoord-dml_6_442659379752037218v0] [leaders="[<NodeID: 372>]"][2024/09/26 09:14:13.063 +00:00] [WARN] [proxy/task_policies.go:119] ["failed to search/query with round-robin policy"] [traceID=62505beaa974c903] [error="Channel: by-dev-rootcoord-dml_7_442659379752037218v1 returns err: code: UnexpectedError, error: fail to Search, QueryNode ID=372, reason=Search 372 failed, reason Timestamp lag too large lag(28h44m48.341s) max(24h0m0s) err %!w(<nil>)Channel: by-dev-rootcoord-dml_6_442659379752037218v0 returns err: code: UnexpectedError, error: fail to Search, QueryNode ID=372, reason=Search 372 failed, reason Timestamp lag too large lag(28h44m48.341s) max(24h0m0s) err %!w(<nil>)"][2024/09/26 09:14:13.063 +00:00] [WARN] [proxy/task_search.go:412] ["failed to do search"] [traceID=62505beaa974c903] [Shards="map[by-dev-rootcoord-dml_6_442659379752037218v0:[<NodeID: 372>] by-dev-rootcoord-dml_7_442659379752037218v1:[<NodeID: 372>]]"] [error="code: UnexpectedError, error: fail to Search, QueryNode ID=372, reason=Search 372 failed, reason Timestamp lag too large lag(28h44m48.341s) max(24h0m0s) err %!w(<nil>)"][2024/09/26 09:14:13.063 +00:00] [WARN] [proxy/task_search.go:425] ["first search failed, updating shardleader caches and retry search"] [traceID=62505beaa974c903] [error="code: UnexpectedError, error: fail to Search, QueryNode ID=372, reason=Search 372 failed, reason Timestamp lag too large lag(28h44m48.341s) max(24h0m0s) err %!w(<nil>)"][2024/09/26 09:14:13.063 +00:00] [INFO] [proxy/meta_cache.go:767] ["clearing shard cache for collection"] [collectionName=xxx][2024/09/26 09:14:13.063 +00:00] [WARN] [retry/retry.go:44] ["retry func failed"] ["retry time"=0] [error="code: UnexpectedError, error: fail to Search, QueryNode ID=372, reason=Search 372 failed, reason Timestamp lag too large lag(28h44m48.341s) max(24h0m0s) err %!w(<nil>)"][2024/09/26 09:14:13.063 +00:00] [WARN] [proxy/task_scheduler.go:473] ["Failed to execute task: "] [error="fail to search on all shard leaders, err=All attempts results:\nattempt #1:code: UnexpectedError, error: fail to Search, QueryNode ID=372, reason=Search 372 failed, reason Timestamp lag too large lag(28h44m48.341s) max(24h0m0s) err %!w(<nil>)\nattempt #2:context canceled\n"] [traceID=62505beaa974c903][2024/09/26 09:14:13.063 +00:00] [WARN] [proxy/impl.go:2861] ["Search failed to WaitToFinish"] [traceID=62505beaa974c903] [error="fail to search on all shard leaders, err=All attempts results:\nattempt #1:code: UnexpectedError, error: fail to Search, QueryNode ID=372, reason=Search 372 failed, reason Timestamp lag too large lag(28h44m48.341s) max(24h0m0s) err %!w(<nil>)\nattempt #2:context canceled\n"] [role=proxy] [msgID=452812354979102723] [db=] [collection=xxx] [partitions="[]"] [dsl=] [len(PlaceholderGroup)=4108] [OutputFields="[id,text,extra]"] [search_params="[{"key":"params","value":"{\\"ef\\":250}"},{"key":"anns_field","value":"vector"},{"key":"topk","value":"100"},{"key":"metric_type","value":"L2"},{"key":"round_decimal","value":"-1"}]"] [travel_timestamp=0] [guarantee_timestamp=0]