某个服务生产数据量大了就挂掉了

Blade 未结 1 1027
wangjie
wangjie 剑圣 2022-05-11 17:19

一、该问题的重现步骤是什么?

1. 

==============  Sql Start  ==============

Execute SQL : select id, id_hex from factory_secondtest_data_SSD_1 where is_deleted = 0 and id_hex in ('D81897') and id_hex != '000000' and time >= 1644564004000 and 1652253604488 >= time

Execute Time: 0.367ms

==============  Sql  End   ==============


2022-05-11 15:31:13.130  INFO [blade-iot,4c8401370367262c,4c8401370367262c,true] 13611 --- [  XNIO-1 task-1] o.s.core.mp.plugins.SqlLogInterceptor    : 


==============  Sql Start  ==============

Execute SQL : select id, id_hex from factory_secondtest_data_SSD_1 where is_deleted = 0 and id_hex in ('D82D03') and id_hex != '000000' and time >= 1644564564000 and 1652254164485 >= time

Execute Time: 0.455ms

==============  Sql  End   ==============


2022-05-11 15:31:13.131  INFO [blade-iot,1307c52cb2f7cf8f,1307c52cb2f7cf8f,false] 13611 --- [ XNIO-1 task-12] o.s.core.mp.plugins.SqlLogInterceptor    : 


==============  Sql Start  ==============

Execute SQL : select id, id_hex from factory_secondtest_data_SSD_1 where is_deleted = 0 and id_hex in ('D82D22') and id_hex != '000000' and time >= 1644564440000 and 1652254040042 >= time

Execute Time: 0.296ms

==============  Sql  End   ==============


2022-05-11 15:31:15.663 ERROR [blade-iot,,,] 13611 --- [t.remote.worker] c.a.n.c.remote.client.grpc.GrpcClient    : Server check fail, please check server 47.134.186.192 ,port 9848 is available , error ={}


java.util.concurrent.TimeoutException: Waited 3000 milliseconds (plus 4260 milliseconds, 423900 nanoseconds delay) for com.alibaba.nacos.shaded.io.grpc.stub.ClientCalls$GrpcFuture@477b491f[status=PENDING, info=[GrpcFuture{clientCall={delegate={delegate=ClientCallImpl{method=MethodDescriptor{fullMethodName=Request/request, type=UNARY, idempotent=false, safe=false, sampledToLocalTracing=true, requestMarshaller=com.alibaba.nacos.shaded.io.grpc.protobuf.lite.ProtoLiteUtils$MessageMarshaller@583c8ed6, responseMarshaller=com.alibaba.nacos.shaded.io.grpc.protobuf.lite.ProtoLiteUtils$MessageMarshaller@5faeb374, schemaDescriptor=com.alibaba.nacos.api.grpc.auto.RequestGrpc$RequestMethodDescriptorSupplier@652d09c7}}}}}]]

at com.alibaba.nacos.shaded.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:508)

at com.alibaba.nacos.common.remote.client.grpc.GrpcClient.serverCheck(GrpcClient.java:146)

at com.alibaba.nacos.common.remote.client.grpc.GrpcClient.connectToServer(GrpcClient.java:268)

at com.alibaba.nacos.common.remote.client.RpcClient.reconnect(RpcClient.java:529)

at com.alibaba.nacos.common.remote.client.RpcClient$3.run(RpcClient.java:374)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)


2022-05-11 15:31:15.663 ERROR [blade-iot,,,] 13611 --- [t.remote.worker] c.a.n.c.remote.client.grpc.GrpcClient    : Server check fail, please check server 47.134.186.192 ,port 9848 is available , error ={}


java.util.concurrent.TimeoutException: Waited 3000 milliseconds (plus 4259 milliseconds, 819861 nanoseconds delay) for com.alibaba.nacos.shaded.io.grpc.stub.ClientCalls$GrpcFuture@1c3b0e0a[status=PENDING, info=[GrpcFuture{clientCall={delegate={delegate=ClientCallImpl{method=MethodDescriptor{fullMethodName=Request/request, type=UNARY, idempotent=false, safe=false, sampledToLocalTracing=true, requestMarshaller=com.alibaba.nacos.shaded.io.grpc.protobuf.lite.ProtoLiteUtils$MessageMarshaller@583c8ed6, responseMarshaller=com.alibaba.nacos.shaded.io.grpc.protobuf.lite.ProtoLiteUtils$MessageMarshaller@5faeb374, schemaDescriptor=com.alibaba.nacos.api.grpc.auto.RequestGrpc$RequestMethodDescriptorSupplier@652d09c7}}}}}]]

at com.alibaba.nacos.shaded.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:508)

at com.alibaba.nacos.common.remote.client.grpc.GrpcClient.serverCheck(GrpcClient.java:146)

at com.alibaba.nacos.common.remote.client.grpc.GrpcClient.connectToServer(GrpcClient.java:268)

at com.alibaba.nacos.common.remote.client.RpcClient.reconnect(RpcClient.java:529)

at com.alibaba.nacos.common.remote.client.RpcClient$3.run(RpcClient.java:374)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)


2022-05-11 15:31:15.664  INFO [blade-iot,1307c52cb2f7cf8f,1307c52cb2f7cf8f,false] 13611 --- [ XNIO-1 task-12] o.s.core.mp.plugins.SqlLogInterceptor    : 


==============  Sql Start  ==============

Execute SQL : select id, id_hex from factory_secondtest_data_SSD_1 where is_deleted = 0 and id_hex in ('D82D23') and id_hex != '000000' and time >= 1644564440000 and 1652254040044 >= time

Execute Time: 0.696ms

==============  Sql  End   ==============

2. 

3.


二、你期待的结果是什么?实际看到的又是什么?

运行日志如上,某个服务突然挂掉,服务从nacos下线了,等会又会自动启动,几台机器生产数据没问题,机器多了就会某个服务挂掉,请帮忙分析下啥原因?

三、你正在使用的是什么产品,什么版本?在什么操作系统上?


四、请提供详细的错误堆栈信息,这很重要。


五、若有更多详细信息,请在下面提供。

1条回答
  • 2022-05-11 18:09

    看报错是连接超时,导致心跳检测失败,然后下线。等下次心跳检测正常后服务又上线了。

    java.util.concurrent.TimeoutException: Waited 3000 milliseconds

    一般出现这种情况,比较大的概率有两种,可以有针对地处理一下:

    一、接口执行时间过长,超过了等待时间,导致心跳失败,服务下线。(处理方法是监测哪些接口时间长,调用几次接口看看是否下线,然后优化接口)

    二、内存或cpu占用资源过多,导致服务下线。(处理方法就是先关闭几个服务,再调用,看看是否还会挂掉)

    0 讨论(0)
提交回复