Create debug

#1
by Jennny - opened

(SGLangEngine pid=18255) [2026-05-02 04:00:54] [http_server] Error: The input (131134 tokens) is longer than the model's context length (131072 tokens).
(SGLangEngine pid=18255) [2026-05-02 04:00:54] INFO: 10.29.240.19:38378 - "POST /generate HTTP/1.1" 400 Bad Request
(RolloutManager pid=10873) 2026-05-01 20:00:55 WARN smg::response: /home/runner/work/sglang/sglang/src/middleware.rs:354: request failed with client error
(RolloutManager pid=10873) [2026-05-02 04:00:55] http_utils.py:185 - Error: Client error '400 Bad Request' for url 'http://10.29.240.19:4362/generate'
(RolloutManager pid=10873) For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400, retrying... (attempt 34/60, url=http://10.29.240.19:4362/generate, response={"error":{"message":"The input (131134 tokens) is longer than the model's context length (131072 tokens)."}})
(SGLangEngine pid=18255) [2026-05-02 04:00:55] [http_server] Error: The input (131134 tokens) is longer than the model's context length (131072 tokens).
(SGLangEngine pid=18255) [2026-05-02 04:00:55] INFO: 10.29.240.19:38378 - "POST /generate HTTP/1.1" 400 Bad Request
(RolloutManager pid=10873) 2026-05-01 20:00:56 WARN smg::response: /home/runner/work/sglang/sglang/src/middleware.rs:354: request failed with client error
(RolloutManager pid=10873) [2026-05-02 04:00:56] http_utils.py:185 - Error: Client error '400 Bad Request' for url 'http://10.29.240.19:4362/generate'
(RolloutManager pid=10873) For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400, retrying... (attempt 35/60, url=http://10.29.240.19:4362/generate, response={"error":{"message":"The input (131134 tokens) is longer than the model's context length (131072 tokens)."}})
(SGLangEngine pid=18255) [2026-05-02 04:00:56] [http_server] Error: The input (131134 tokens) is longer than the model's context length (131072 tokens).
(SGLangEngine pid=18255) [2026-05-02 04:00:56] INFO: 10.29.240.19:38378 - "POST /generate HTTP/1.1" 400 Bad Request
(RolloutManager pid=10873) 2026-05-01 20:00:57 WARN smg::response: /home/runner/work/sglang/sglang/src/middleware.rs:354: request failed with client error
(RolloutManager pid=10873) [2026-05-02 04:00:57] http_utils.py:185 - Error: Client error '400 Bad Request' for url 'http://10.29.240.19:4362/generate'
(RolloutManager pid=10873) For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400, retrying... (attempt 36/60, url=http://10.29.240.19:4362/generate, response={"error":{"message":"The input (131134 tokens) is longer than the model's context length (131072 tokens)."}})
(SGLangEngine pid=18255) [2026-05-02 04:00:57] [http_server] Error: The input (131134 tokens) is longer than the model's context length (131072 tokens).
(SGLangEngine pid=18255) [2026-05-02 04:00:57] INFO: 10.29.240.19:38378 - "POST /generate HTTP/1.1" 400 Bad Request
(RolloutManager pid=10873) 2026-05-01 20:00:59 WARN smg::response: /home/runner/work/sglang/sglang/src/middleware.rs:354: request failed with client error
(RolloutManager pid=10873) [2026-05-02 04:00:59] http_utils.py:185 - Error: Client error '400 Bad Request' for url 'http://10.29.240.19:4362/generate'
(RolloutManager pid=10873) For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400, retrying... (attempt 37/60, url=http://10.29.240.19:4362/generate, response={"error":{"message":"The input (131134 tokens) is longer than the model's context length (131072 tokens)."}})
(SGLangEngine pid=18255) [2026-05-02 04:00:59] [http_server] Error: The input (131134 tokens) is longer than the model's context length (131072 tokens).
(SGLangEngine pid=18255) [2026-05-02 04:00:59] INFO: 10.29.240.19:38378 - "POST /generate HTTP/1.1" 400 Bad Request
(RolloutManager pid=10873) 2026-05-01 20:01:00 WARN smg::response: /home/runner/work/sglang/sglang/src/middleware.rs:354: request failed with client error
(RolloutManager pid=10873) [2026-05-02 04:01:00] http_utils.py:185 - Error: Client error '400 Bad Request' for url 'http://10.29.240.19:4362/generate'
(RolloutManager pid=10873) For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400, retrying... (attempt 38/60, url=http://10.29.240.19:4362/generate, response={"error":{"message":"The input (131134 tokens) is longer than the model's context length (131072 tokens)."}})
(SGLangEngine pid=18255) [2026-05-02 04:01:00] [http_server] Error: The input (131134 tokens) is longer than the model's context length (131072 tokens).
(SGLangEngine pid=18255) [2026-05-02 04:01:00] INFO: 10.29.240.19:38378 - "POST /generate HTTP/1.1" 400 Bad Request
(RolloutManager pid=10873) 2026-05-01 20:01:01 WARN smg::response: /home/runner/work/sglang/sglang/src/middleware.rs:354: request failed with client error
(RolloutManager pid=10873) [2026-05-02 04:01:01] http_utils.py:185 - Error: Client error '400 Bad Request' for url 'http://10.29.240.19:4362/generate'
(RolloutManager pid=10873) For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400, retrying... (attempt 39/60, url=http://10.29.240.19:4362/generate, response={"error":{"message":"The input (131134 tokens) is longer than the model's context length (131072 tokens)."}})
(SGLangEngine pid=18255) [2026-05-02 04:01:01] [http_server] Error: The input (131134 tokens) is longer than the model's context length (131072 tokens).
(SGLangEngine pid=18255) [2026-05-02 04:01:01] INFO: 10.29.240.19:38378 - "POST /generate HTTP/1.1" 400 Bad Request
(RolloutManager pid=10873) 2026-05-01 20:01:02 WARN smg::response: /home/runner/work/sglang/sglang/src/middleware.rs:354: request failed with client error
(RolloutManager pid=10873) [2026-05-02 04:01:02] http_utils.py:185 - Error: Client error '400 Bad Request' for url 'http://10.29.240.19:4362/generate'
(RolloutManager pid=10873) For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400, retrying... (attempt 40/60, url=http://10.29.240.19:4362/generate, response={"error":{"message":"The input (131134 tokens) is longer than the model's context length (131072 tokens)."}})
(SGLangEngine pid=18255) [2026-05-02 04:01:02] [http_server] Error: The input (131134 tokens) is longer than the model's context length (131072 tokens).
(SGLangEngine pid=18255) [2026-05-02 04:01:02] INFO: 10.29.240.19:38378 - "POST /generate HTTP/1.1" 400 Bad Request
(RolloutManager pid=10873) 2026-05-01 20:01:04 ERROR smg::response: /home/runner/work/sglang/sglang/src/middleware.rs:349: request failed with server error
(RolloutManager pid=10873) [2026-05-02 04:01:04] http_utils.py:185 - Error: Server error '503 Service Unavailable' for url 'http://10.29.240.19:4362/generate'
(RolloutManager pid=10873) For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/503, retrying... (attempt 41/60, url=http://10.29.240.19:4362/generate, response={"error":{"type":"Service Unavailable","code":"no_available_workers","message":"No available workers (all circuits open or unhealthy)"}})
(RolloutManager pid=10873) 2026-05-01 20:01:05 ERROR smg::response: /home/runner/work/sglang/sglang/src/middleware.rs:349: request failed with server error
(RolloutManager pid=10873) [2026-05-02 04:01:05] http_utils.py:185 - Error: Server error '503 Service Unavailable' for url 'http://10.29.240.19:4362/generate'
(RolloutManager pid=10873) For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/503, retrying... (attempt 42/60, url=http://10.29.240.19:4362/generate, response={"error":{"type":"Service Unavailable","code":"no_available_workers","message":"No available workers (all circuits open or unhealthy)"}})
(RolloutManager pid=10873) 2026-05-01 20:01:06 ERROR smg::response: /home/runner/work/sglang/sglang/src/middleware.rs:349: request failed with server error
(RolloutManager pid=10873) [2026-05-02 04:01:06] http_utils.py:185 - Error: Server error '503 Service Unavailable' for url 'http://10.29.240.19:4362/generate'
(RolloutManager pid=10873) For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/503, retrying... (attempt 43/60, url=http://10.29.240.19:4362/generate, response={"error":{"type":"Service Unavailable","code":"no_available_workers","message":"No available workers (all circuits open or unhealthy)"}})
(RolloutManager pid=10873) 2026-05-01 20:01:08 ERROR smg::response: /home/runner/work/sglang/sglang/src/middleware.rs:349: request failed with server error
(RolloutManager pid=10873) [2026-05-02 04:01:08] http_utils.py:185 - Error: Server error '503 Service Unavailable' for url 'http://10.29.240.19:4362/generate'
(RolloutManager pid=10873) For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/503, retrying... (attempt 44/60, url=http://10.29.240.19:4362/generate, response={"error":{"type":"Service Unavailable","code":"no_available_workers","message":"No available workers (all circuits open or unhealthy)"}})
(RolloutManager pid=10873) 2026-05-01 20:01:09 ERROR smg::response: /home/runner/work/sglang/sglang/src/middleware.rs:349: request failed with server error
(RolloutManager pid=10873) [2026-05-02 04:01:09] http_utils.py:185 - Error: Server error '503 Service Unavailable' for url 'http://10.29.240.19:4362/generate'
(RolloutManager pid=10873) For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/503, retrying... (attempt 45/60, url=http://10.29.240.19:4362/generate, response={"error":{"type":"Service Unavailable","code":"no_available_workers","message":"No available workers (all circuits open or unhealthy)"}})
(RolloutManager pid=10873) 2026-05-01 20:01:11 ERROR smg::response: /home/runner/work/sglang/sglang/src/middleware.rs:349: request failed with server error
(RolloutManager pid=10873) [2026-05-02 04:01:11] http_utils.py:185 - Error: Server error '503 Service Unavailable' for url 'http://10.29.240.19:4362/generate'

(RolloutManager pid=10873) 2026-05-01 20:01:31 ERROR smg::response: /home/runner/work/sglang/sglang/src/middleware.rs:349: request failed with server error
(RolloutManager pid=10873) [2026-05-02 04:01:31] http_utils.py:185 - Error: Server error '503 Service Unavailable' for url 'http://10.29.240.19:4362/generate'
(RolloutManager pid=10873) For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/503, retrying... (attempt 60/60, url=http://10.29.240.19:4362/generate, response={"error":{"type":"Service Unavailable","code":"no_available_workers","message":"No available workers (all circuits open or unhealthy)"}})
(RolloutManager pid=10873) [2026-05-02 04:01:31] http_utils.py:189 - Max retries (60) reached, failing... (url=http://10.29.240.19:4362/generate)
(RolloutManager pid=10873) [2026-05-02 04:01:31] acot_rollout.py:93 - [acot_rollout] active=1 inflight=0 reqs=240593 ctrl=118711 reas=121882 done=2815
Traceback (most recent call last):
File "/mnt/data/yiran/AgenticCoT/slime/train.py", line 172, in
train(args)
File "/mnt/data/yiran/AgenticCoT/slime/train.py", line 139, in train
rollout_data_ref = ray.get(rollout_manager.generate.remote(rollout_id))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/miniconda3/envs/slime/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/mnt/data/miniconda3/envs/slime/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 107, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/miniconda3/envs/slime/lib/python3.12/site-packages/ray/_private/worker.py", line 2980, in get
values, debugger_breakpoint = worker.get_objects(
^^^^^^^^^^^^^^^^^^^
File "/mnt/data/miniconda3/envs/slime/lib/python3.12/site-packages/ray/_private/worker.py", line 1025, in get_objects
raise value
ray.exceptions.UnserializableException: Failed to deserialize exception. Refer to https://docs.ray.io/en/latest/ray-core/objects/serialization.html#custom-serializers-for-exceptions for more information.
Original exception:
ray.exceptions.RayTaskError: ray::RolloutManager.generate() (pid=10873, ip=10.29.240.19, actor_id=d12ded2f3d9b91f0a35981ac02000000, repr=<slime.ray.rollout.RolloutManager object at 0x7fc957e0be00>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/ray/rollout.py", line 484, in generate
data, metrics = self._get_rollout_data(rollout_id=rollout_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/ray/rollout.py", line 583, in _get_rollout_data
data = call_rollout_fn(self.generate_rollout, self.args, rollout_id, self.data_source, evaluation=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/rollout/base_types.py", line 20, in call_rollout_fn
output = fn(*args, **kwargs, evaluation=evaluation)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/rollout/sglang_rollout.py", line 587, in generate_rollout
output, aborted_samples = run(generate_rollout_async(args, rollout_id, data_source.get_samples))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/utils/async_utils.py", line 36, in run
return get_async_loop().run(coro)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/utils/async_utils.py", line 20, in run
return asyncio.run_coroutine_threadsafe(coro, self.loop).result()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/miniconda3/envs/slime/lib/python3.12/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/mnt/data/miniconda3/envs/slime/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/mnt/data/yiran/AgenticCoT/slime/slime/rollout/sglang_rollout.py", line 398, in generate_rollout_async
group: list[Sample] = task.result()
^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/rollout/sglang_rollout.py", line 301, in generate_and_rm_group
group = await asyncio.gather(tasks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/rollout/sglang_rollout.py", line 246, in generate_and_rm
sample = await custom_generate_func(args, sample, sampling_params, evaluation=evaluation)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/rollout/acot_rollout.py", line 453, in generate
return await _generate_impl(args, sample, sampling_params, evaluation)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/rollout/acot_rollout.py", line 766, in _generate_impl
mc_results = await asyncio.gather(
(_one_mc(k) for k in range(mc_per_sample)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/rollout/acot_rollout.py", line 687, in _one_mc
reasoner_text, finish_type = await _call_reasoner(
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/rollout/acot_rollout.py", line 367, in _call_reasoner
output = await _tracked_post(url, payload, kind="reasoner", headers=headers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/rollout/acot_rollout.py", line 131, in _tracked_post
return await post(url, payload, headers=headers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/utils/http_utils.py", line 290, in post
return await _post(_http_client, url, payload, max_retries, headers=headers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/utils/http_utils.py", line 190, in _post
raise e
File "/mnt/data/yiran/AgenticCoT/slime/slime/utils/http_utils.py", line 171, in _post
response.raise_for_status()
File "/mnt/data/miniconda3/envs/slime/lib/python3.12/site-packages/httpx/_models.py", line 829, in raise_for_status
raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Server error '503 Service Unavailable' for url 'http://10.29.240.19:4362/generate'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/503

Traceback (most recent call last):
File "/mnt/data/yiran/AgenticCoT/slime/train.py", line 172, in
train(args)
File "/mnt/data/yiran/AgenticCoT/slime/train.py", line 139, in train
rollout_data_ref = ray.get(rollout_manager.generate.remote(rollout_id))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/miniconda3/envs/slime/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/mnt/data/miniconda3/envs/slime/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 107, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/miniconda3/envs/slime/lib/python3.12/site-packages/ray/_private/worker.py", line 2980, in get
values, debugger_breakpoint = worker.get_objects(
^^^^^^^^^^^^^^^^^^^
File "/mnt/data/miniconda3/envs/slime/lib/python3.12/site-packages/ray/_private/worker.py", line 1025, in get_objects
raise value
ray.exceptions.UnserializableException: Failed to deserialize exception. Refer to https://docs.ray.io/en/latest/ray-core/objects/serialization.html#custom-serializers-for-exceptions for more information.
Original exception:
ray.exceptions.RayTaskError: ray::RolloutManager.generate() (pid=10873, ip=10.29.240.19, actor_id=d12ded2f3d9b91f0a35981ac02000000, repr=<slime.ray.rollout.RolloutManager object at 0x7fc957e0be00>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/ray/rollout.py", line 484, in generate
data, metrics = self._get_rollout_data(rollout_id=rollout_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/ray/rollout.py", line 583, in _get_rollout_data
data = call_rollout_fn(self.generate_rollout, self.args, rollout_id, self.data_source, evaluation=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/rollout/base_types.py", line 20, in call_rollout_fn
output = fn(*args, **kwargs, evaluation=evaluation)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/rollout/sglang_rollout.py", line 587, in generate_rollout
output, aborted_samples = run(generate_rollout_async(args, rollout_id, data_source.get_samples))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/utils/async_utils.py", line 36, in run
return get_async_loop().run(coro)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/utils/async_utils.py", line 20, in run
return asyncio.run_coroutine_threadsafe(coro, self.loop).result()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/miniconda3/envs/slime/lib/python3.12/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/mnt/data/miniconda3/envs/slime/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/mnt/data/yiran/AgenticCoT/slime/slime/rollout/sglang_rollout.py", line 398, in generate_rollout_async
group: list[Sample] = task.result()
^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/rollout/sglang_rollout.py", line 301, in generate_and_rm_group
group = await asyncio.gather(tasks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/rollout/sglang_rollout.py", line 246, in generate_and_rm
sample = await custom_generate_func(args, sample, sampling_params, evaluation=evaluation)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/rollout/acot_rollout.py", line 453, in generate
return await _generate_impl(args, sample, sampling_params, evaluation)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/rollout/acot_rollout.py", line 766, in _generate_impl
mc_results = await asyncio.gather(
(_one_mc(k) for k in range(mc_per_sample)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/rollout/acot_rollout.py", line 687, in _one_mc
reasoner_text, finish_type = await _call_reasoner(
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/rollout/acot_rollout.py", line 367, in _call_reasoner
output = await _tracked_post(url, payload, kind="reasoner", headers=headers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/rollout/acot_rollout.py", line 131, in _tracked_post
return await post(url, payload, headers=headers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/utils/http_utils.py", line 290, in post
return await _post(_http_client, url, payload, max_retries, headers=headers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/yiran/AgenticCoT/slime/slime/utils/http_utils.py", line 190, in _post
raise e
File "/mnt/data/yiran/AgenticCoT/slime/slime/utils/http_utils.py", line 171, in _post
response.raise_for_status()
File "/mnt/data/miniconda3/envs/slime/lib/python3.12/site-packages/httpx/_models.py", line 829, in raise_for_status
raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Server error '503 Service Unavailable' for url 'http://10.29.240.19:4362/generate'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/503

[dlc8j74eijhgn7lq-master-0] Rollout generation: 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 120/128 [1:15:02<01:03, 7.88s/it][2026-04-27 02:27:00] acot_rollout.py:93 - [acot_rollout] active=1 inflight=0 reqs=556320 ctrl=273127 reas=283193 done=1279
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=18268) [2026-04-27 02:27:00] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 7036, token usage: 0.01, #running-req: 0, #queue-req: 0, input throughput (token/s): 14.75, cuda graph: False [repeated 2x across cluster]
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=18268) [2026-04-27 02:27:01] Decode batch, #running-req: 1, #token: 7072, token usage: 0.01, cuda graph: True, gen throughput (token/s): 27.31, #queue-req: 0 [repeated 2x across cluster]
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=18268) [2026-04-27 02:27:07] Decode batch, #running-req: 1, #token: 7232, token usage: 0.01, cuda graph: True, gen throughput (token/s): 29.10, #queue-req: 0 [repeated 4x across cluster]
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=18268) [2026-04-27 02:27:09] INFO: 10.29.240.32:38800 - "POST /generate HTTP/1.1" 200 OK [repeated 2x across cluster]
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671)
[dlc8j74eijhgn7lq-master-0] Rollout generation: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 128/128 [1:15:15<00:00, 5.99s/it]
[dlc8j74eijhgn7lq-master-0] Rollout generation: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 128/128 [1:15:15<00:00, 35.28s/it]
[+] [dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:13] sglang_rollout.py:423 - Finish rollout: ["<|im_start|>system\nYou are a reasoning control agent that guides a step-by-step mathematical reasoner.\n\nEach turn you receive the reasoner's latest step and the remaining token budget (as a percentage). On the first turn you receive the question instead. Considering the reasoner's current progress and planning on the remaining budget, choose an action and a starter phrase for the reasoner's next step. Your goal is to guide the reasoner to make progress towards the final correct answer while staying within the budget.\n\nActions:\n- understand: Restate or clarify the problem.\n- plan: Decide strategy or split into subproblems.\n- execute: Carry out algebraic or logical steps.\n- explore: Try an alternative approach.\n- check: Verify steps or correct mistakes.\n- summarize: Summarize current results.\n- conclude: State the final answer.\n\nOutput format:\nAction: \nStarter: <|im_end|>\n<|im_start|>user\nQuestion:\nDetermine the smallest prime $p$ such that $2018!$ is divisible by $p^{3}$ , but not divisible by $p^{4}$.\n\nBudget Remaining: 100%<|im_end|>\n<|im_start|>assistant\nAction: understand\nStarter: Okay, so I have this problem:<|im_end|>\n<|im_start|>user\nReasoner's Step:\nOkay, so I have this problem: I need to find the smallest prime number p such that 2018! is divisible by p cubed (p³), but not by p to the fourth power (p⁴). Hmm, let's think about how to approach this.\n\nBudget Remaining: 99%<|im_end|>\n<|im_start|>assistant\nAction: plan\nStarter: First, I remember that when we're<|im_end|>\n<|im_start|>user\nReasoner's Step:\nFirst, I remember that when we're dealing with factorials and divisibility by prime powers, we can use Legendre's formula. Legendre's formula helps us find the exponent of a prime p in the prime factorization of n! (n factorial). The formula is the sum of the floor of n divided by p^k for k from 1 to infinity. Basically, it's adding up how many multiples of p, p², p³, etc., are in the numbers from 1 to n.\n\nBudget Remaining: 97%<|im_end|>\n<|im_start|>assistant\nAction: check\nStarter: So, for each<|im_end|>\n<|im_start|>user\nReasoner's Step:\nSo, for each prime p, we need to calculate the exponent of p in 2018! and check if that exponent is exactly 3. Wait, no. The problem says that 2018! must be divisible by p³ but not by p⁴. That means the exponent of p in 2018! must be exactly 3? Wait, no. Actually, if it's divisible by p³, the exponent is at least 3, and if it's not divisible by p⁴, then the exponent is less than 4. So, the exponent of p in 2018! should be 3. That makes sense. So, we need primes p where the exponent of p in 2018! is exactly 3. And we need the smallest such prime.\n\nBudget Remaining: 94%<|im_end|>\n<|im_start|>assistant\nAction: check\nStarter: But wait,<|im_end|>\n<|im_start|>user\nReasoner's Step:\nBut wait, primes smaller than p will have higher exponents in the factorial. Because primes divide numbers more frequently. So, the smaller the prime, the higher the exponent in the factorial. Therefore, primes smaller than a certain value will have exponents greater than 3, and primes larger than that will have exponents less than 3. Wait, but the problem is about exponents equal to 3? Wait, no. Let's clarify.\n\nBudget Remaining: 93%<|im_end|>\n<|im_start|>assistant\nAction: understand\nStarter: If the<|im_end|>\n<|im_start|>user\nReasoner's Step:\nIf the exponent of p in 2018! is exactly 3, then 2018! is divisible by p³ but not p⁴. So, primes where the exponent is exactly 3. So, we need primes p where when we compute the sum of floor(2018/p) + floor(2018/p²) + floor(2018/p³) + ..., the total is 3.\n\nBudget Remaining: 91%<|im_end|>\n<|im_start|>assistant\nAction: summarize\nStarter: Therefore, the<|im_end|>\n<|im_start|>user\nReasoner's Step:\nTherefore, the task reduces to finding the smallest prime p such that this sum equals 3. Since primes get larger, and the exponents of primes...
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:13] sglang_rollout.py:326 - Abort request for ['http://10.29.240.32:15006', 'http://10.29.240.32:15000', 'http://10.29.240.32:15002', 'http://10.29.240.32:15004']
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:13] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 1/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=18268) [2026-04-27 02:27:08] Decode batch, #running-req: 1, #token: 7272, token usage: 0.01, cuda graph: True, gen throughput (token/s): 29.12, #queue-req: 0
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11357) [2026-04-27 02:27:13] INFO: 10.29.240.32:33492 - "POST /abort_request HTTP/1.1" 200 OK
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:14] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 2/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:15] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 3/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:16] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 4/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:17] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 5/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:18] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 6/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:19] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 7/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:20] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 8/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:21] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 9/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:22] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 10/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:23] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 11/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:24] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 12/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:25] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 13/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:26] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 14/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:27] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 15/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:28] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 16/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:29] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 17/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:30] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 18/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:31] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 19/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:32] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 20/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:33] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 21/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:34] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 22/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:35] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 23/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:36] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 24/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:37] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 25/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:38] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 26/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:39] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 27/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:40] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 28/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:41] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 29/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:42] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 30/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:43] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 31/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:44] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 32/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:45] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 33/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:46] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 34/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:47] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 35/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:48] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 36/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:49] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 37/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:50] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 38/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:51] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 39/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:52] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 40/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:53] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 41/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:54] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 42/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:55] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 43/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:56] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 44/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:57] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 45/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:58] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 46/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:27:59] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 47/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:28:00] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 48/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:28:01] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 49/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:28:02] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 50/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:28:03] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 51/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:28:04] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 52/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:28:05] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 53/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:28:06] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 54/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:28:07] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 55/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:28:08] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 56/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:28:09] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 57/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:28:10] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 58/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:28:11] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 59/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:28:12] http_utils.py:185 - Error: All connection attempts failed, retrying... (attempt 60/60, url=http://10.29.240.32:15000/abort_request, response=None)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:28:12] http_utils.py:189 - Max retries (60) reached, failing... (url=http://10.29.240.32:15000/abort_request)
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:28:12] sglang_rollout.py:331 - Failed to abort worker at http://10.29.240.32:15000: All connection attempts failed
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:28:12] rollout.py:605 - Final collected 128 samples from rollout to train
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:28:12] acot_log.py:374 - controller metrics 9: {'controller/correctness': 0.580078125, 'controller/num_steps': 77.3046875, 'controller/budget_used_pct': 94.3965059837185, 'controller/budget_remained_pct': 8.162939412797705, 'rollout/step': 9, 'controller/rewards': 0.482421875, 'controller/reward_std_within_group': 0.23536898350658472, 'controller/zero_var_group_ratio': 0.0625, 'controller/starter_diversity_within_group': 3.375, 'controller/within_budget_ratio': 0.8046875, 'controller/over_budget_ratio': 0.1953125}
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:28:12] acot_log.py:140 - [acot_log] Appended 8 sampled controller rows to /mnt/data/yiran/AgenticCoT/slime/slime/rollout/../../../logs/controller_sample_groups.jsonl
[dlc8j74eijhgn7lq-master-0] (RolloutManager pid=10671) [2026-04-27 02:28:12] rollout.py:1194 - perf 9: {'rollout/response_len/mean': 11.6953125, 'rollout/response_len/median': 11.0, 'rollout/response_len/max': 16, 'rollout/response_len/min': 9, 'rollout/zero_std/count_0.0': 1, 'rollout/repetition_frac': 0.0, 'rollout/truncated_ratio': 0.0, 'perf/rollout_time': 4574.8515474796295, 'perf/tokens_per_gpu_per_sec': 0.04090296658983189, 'perf/longest_sample_tokens_per_sec': 0.003497381244821965, 'perf/effective_tokens_per_gpu_per_sec': 0.04090296658983189, 'perf/longest_effective_sample_tokens_per_sec': 0.003497381244821965}
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11359) [2026-04-27 02:28:12] Cache flushed successfully!
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11357) [2026-04-27 02:28:12] INFO: 10.29.240.32:39588 - "GET /flush_cache HTTP/1.1" 200 OK
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11359) [2026-04-27 02:27:13] INFO: 10.29.240.32:40208 - "POST /abort_request HTTP/1.1" 200 OK [repeated 2x across cluster]
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:12] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11357) [2026-04-27 02:28:12] INFO: 10.29.240.32:39596 - "POST /release_memory_occupation HTTP/1.1" 200 OK
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:13] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:14] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:15] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:16] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:17] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11358) [2026-04-27 02:28:12] Cache flushed successfully! [repeated 5x across cluster]
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:18] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:19] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:20] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:21] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:22] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:23] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:24] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:25] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:26] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:27] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:28] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:29] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:30] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:31] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:32] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:33] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:34] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:35] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:36] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:37] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:38] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:39] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:40] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:41] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:42] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:43] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:44] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:45] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:46] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:47] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:48] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:49] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:50] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:51] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:52] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:53] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:54] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:55] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:56] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:57] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:58] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:28:59] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:29:00] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:29:01] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:29:02] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:29:03] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:29:04] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:29:05] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:29:06] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:29:07] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:29:08] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:29:09] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:29:10] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11356) [2026-04-27 02:29:11] INFO sglang_engine.py:299: Error flushing cache: HTTPConnectionPool(host='10.29.240.32', port=15000): Max retries exceeded with url: /flush_cache (Caused by NewConnectionError("HTTPConnection(host='10.29.240.32', port=15000): Failed to establish a new connection: [Errno 111] Connection refused"))
[dlc8j74eijhgn7lq-master-0] Traceback (most recent call last):
[dlc8j74eijhgn7lq-master-0] File "/mnt/data/yiran/AgenticCoT/slime/train.py", line 110, in
[dlc8j74eijhgn7lq-master-0] train(args)
[dlc8j74eijhgn7lq-master-0] File "/mnt/data/yiran/AgenticCoT/slime/train.py", line 80, in train
[dlc8j74eijhgn7lq-master-0] ray.get(rollout_manager.offload.remote())
[dlc8j74eijhgn7lq-master-0] File "/mnt/data/miniconda3/envs/slime/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
[dlc8j74eijhgn7lq-master-0] return fn(*args, **kwargs)
[dlc8j74eijhgn7lq-master-0] ^^^^^^^^^^^^^^^^^^^
[dlc8j74eijhgn7lq-master-0] File "/mnt/data/miniconda3/envs/slime/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 107, in wrapper
[dlc8j74eijhgn7lq-master-0] return func(*args, **kwargs)
[dlc8j74eijhgn7lq-master-0] ^^^^^^^^^^^^^^^^^^^^^
[dlc8j74eijhgn7lq-master-0] File "/mnt/data/miniconda3/envs/slime/lib/python3.12/site-packages/ray/_private/worker.py", line 2980, in get
[dlc8j74eijhgn7lq-master-0] values, debugger_breakpoint = worker.get_objects(
[dlc8j74eijhgn7lq-master-0] ^^^^^^^^^^^^^^^^^^^
[dlc8j74eijhgn7lq-master-0] File "/mnt/data/miniconda3/envs/slime/lib/python3.12/site-packages/ray/_private/worker.py", line 1023, in get_objects
[dlc8j74eijhgn7lq-master-0] raise value.as_instanceof_cause()
[dlc8j74eijhgn7lq-master-0] ray.exceptions.RayTaskError(TimeoutError): ray::RolloutManager.offload() (pid=10671, ip=10.29.240.32, actor_id=882477207e422dcc952b718702000000, repr=<slime.ray.rollout.RolloutManager object at 0x7f3ad4128140>)
[dlc8j74eijhgn7lq-master-0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[dlc8j74eijhgn7lq-master-0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[dlc8j74eijhgn7lq-master-0] File "/mnt/data/yiran/AgenticCoT/slime/slime/ray/rollout.py", line 513, in offload
[dlc8j74eijhgn7lq-master-0] srv.offload()
[dlc8j74eijhgn7lq-master-0] File "/mnt/data/yiran/AgenticCoT/slime/slime/ray/rollout.py", line 317, in offload
[dlc8j74eijhgn7lq-master-0] return ray.get(handles) if handles else []
[dlc8j74eijhgn7lq-master-0] ^^^^^^^^^^^^^^^^
[dlc8j74eijhgn7lq-master-0] ^^^^^^^^^^^^^^^^^^^
[dlc8j74eijhgn7lq-master-0] ^^^^^^^^^^^^^^^^^^^^^
[dlc8j74eijhgn7lq-master-0] ^^^^^^^^^^^^^^^^^^^
[dlc8j74eijhgn7lq-master-0] ray.exceptions.RayTaskError(TimeoutError): ray::SGLangEngine.release_memory_occupation() (pid=11356, ip=10.29.240.32, actor_id=8405bf24d1b7445c5eba11c302000000, repr=<slime.backends.sglang_utils.sglang_engine.SGLangEngine object at 0x7f021ebfcaa0>)
[dlc8j74eijhgn7lq-master-0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[dlc8j74eijhgn7lq-master-0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[dlc8j74eijhgn7lq-master-0] File "/mnt/data/yiran/AgenticCoT/slime/slime/backends/sglang_utils/sglang_engine.py", line 353, in release_memory_occupation
[dlc8j74eijhgn7lq-master-0] self.flush_cache()
[dlc8j74eijhgn7lq-master-0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[dlc8j74eijhgn7lq-master-0] File "/mnt/data/yiran/AgenticCoT/slime/slime/backends/sglang_utils/sglang_engine.py", line 303, in flush_cache
[dlc8j74eijhgn7lq-master-0] raise TimeoutError("Timeout while flushing cache.")
[dlc8j74eijhgn7lq-master-0] TimeoutError: Timeout while flushing cache.
[dlc8j74eijhgn7lq-master-0] Traceback (most recent call last):
[dlc8j74eijhgn7lq-master-0] File "/mnt/data/yiran/AgenticCoT/slime/train.py", line 110, in
[dlc8j74eijhgn7lq-master-0] train(args)
[dlc8j74eijhgn7lq-master-0] File "/mnt/data/yiran/AgenticCoT/slime/train.py", line 80, in train
[dlc8j74eijhgn7lq-master-0] ray.get(rollout_manager.offload.remote())
[dlc8j74eijhgn7lq-master-0] File "/mnt/data/miniconda3/envs/slime/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
[dlc8j74eijhgn7lq-master-0] return fn(*args, **kwargs)
[dlc8j74eijhgn7lq-master-0] ^^^^^^^^^^^^^^^^^^^
[dlc8j74eijhgn7lq-master-0] File "/mnt/data/miniconda3/envs/slime/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 107, in wrapper
[dlc8j74eijhgn7lq-master-0] return func(*args, **kwargs)
[dlc8j74eijhgn7lq-master-0] ^^^^^^^^^^^^^^^^^^^^^
[dlc8j74eijhgn7lq-master-0] File "/mnt/data/miniconda3/envs/slime/lib/python3.12/site-packages/ray/_private/worker.py", line 2980, in get
[dlc8j74eijhgn7lq-master-0] values, debugger_breakpoint = worker.get_objects(
[dlc8j74eijhgn7lq-master-0] ^^^^^^^^^^^^^^^^^^^
[dlc8j74eijhgn7lq-master-0] File "/mnt/data/miniconda3/envs/slime/lib/python3.12/site-packages/ray/_private/worker.py", line 1023, in get_objects
[dlc8j74eijhgn7lq-master-0] raise value.as_instanceof_cause()
[dlc8j74eijhgn7lq-master-0] ray.exceptions.RayTaskError(TimeoutError): ray::RolloutManager.offload() (pid=10671, ip=10.29.240.32, actor_id=882477207e422dcc952b718702000000, repr=<slime.ray.rollout.RolloutManager object at 0x7f3ad4128140>)
[dlc8j74eijhgn7lq-master-0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[dlc8j74eijhgn7lq-master-0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[dlc8j74eijhgn7lq-master-0] File "/mnt/data/yiran/AgenticCoT/slime/slime/ray/rollout.py", line 513, in offload
[dlc8j74eijhgn7lq-master-0] srv.offload()
[dlc8j74eijhgn7lq-master-0] File "/mnt/data/yiran/AgenticCoT/slime/slime/ray/rollout.py", line 317, in offload
[dlc8j74eijhgn7lq-master-0] return ray.get(handles) if handles else []
[dlc8j74eijhgn7lq-master-0] ^^^^^^^^^^^^^^^^
[dlc8j74eijhgn7lq-master-0] ^^^^^^^^^^^^^^^^^^^
[dlc8j74eijhgn7lq-master-0] ^^^^^^^^^^^^^^^^^^^^^
[dlc8j74eijhgn7lq-master-0] ^^^^^^^^^^^^^^^^^^^
[dlc8j74eijhgn7lq-master-0] ray.exceptions.RayTaskError(TimeoutError): ray::SGLangEngine.release_memory_occupation() (pid=11356, ip=10.29.240.32, actor_id=8405bf24d1b7445c5eba11c302000000, repr=<slime.backends.sglang_utils.sglang_engine.SGLangEngine object at 0x7f021ebfcaa0>)
[dlc8j74eijhgn7lq-master-0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[dlc8j74eijhgn7lq-master-0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[dlc8j74eijhgn7lq-master-0] File "/mnt/data/yiran/AgenticCoT/slime/slime/backends/sglang_utils/sglang_engine.py", line 353, in release_memory_occupation
[dlc8j74eijhgn7lq-master-0] self.flush_cache()
[dlc8j74eijhgn7lq-master-0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[dlc8j74eijhgn7lq-master-0] File "/mnt/data/yiran/AgenticCoT/slime/slime/backends/sglang_utils/sglang_engine.py", line 303, in flush_cache
[dlc8j74eijhgn7lq-master-0] raise TimeoutError("Timeout while flushing cache.")
[dlc8j74eijhgn7lq-master-0] TimeoutError: Timeout while flushing cache.
[dlc8j74eijhgn7lq-master-0] wandb:
[dlc8j74eijhgn7lq-master-0] wandb: πŸš€ View run slime-controller-rl_uzu46baa-RANK_0 at: https://wandb.ai/jenny-shen/AgenticCoT/runs/9w619fu1
[dlc8j74eijhgn7lq-master-0] wandb: Find logs at: wandb/run-20260426_184720-9w619fu1/logs
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11359) [2026-04-27 02:28:12] INFO: 10.29.240.32:60998 - "GET /flush_cache HTTP/1.1" 200 OK [repeated 2x across cluster]
[dlc8j74eijhgn7lq-master-0] (SGLangEngine pid=11359) [2026-04-27 02:28:12] INFO: 10.29.240.32:32778 - "POST /release_memory_occupation HTTP/1.1" 200 OK [repeated 2x across cluster]
[dlc8j74eijhgn7lq-master-0] 2026-04-27 02:29:32,649 ERR cli.py:73 -- ---------------------------------------
[dlc8j74eijhgn7lq-master-0] 2026-04-27 02:29:32,649 ERR cli.py:74 -- Job 'raysubmit_pKpnEqQY7ieAmqG4' failed
[dlc8j74eijhgn7lq-master-0] 2026-04-27 02:29:32,649 ERR cli.py:75 -- ---------------------------------------
[dlc8j74eijhgn7lq-master-0] 2026-04-27 02:29:32,649 INFO cli.py:88 -- Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
[dlc8j74eijhgn7lq-master-0] self.flush_cache()
[dlc8j74eijhgn7lq-master-0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[dlc8j74eijhgn7lq-master-0] File "/mnt/data/yiran/AgenticCoT/slime/slime/backends/sglang_utils/sglang_engine.py", line 303, in flush_cache
[dlc8j74eijhgn7lq-master-0] raise TimeoutError("Timeout while flushing cache.")
[dlc8j74eijhgn7lq-master-0] TimeoutError: Timeout while flushing cache.

python - <<'PY'
import triton
print("triton:", getattr(triton, "version", "unknown"))
print("triton file:", triton.file)

from sglang.srt.layers.attention import triton_backend
print("sglang triton backend import: OK")
print("backend file:", triton_backend.file)
PY

[dlczq3cay47mtefu-master-0] (SGLangEngine pid=18721) [2026-04-28 09:22:55] Decode batch, #running-req: 1, #token: 0, token usage: 0.00, cuda graph: True, gen throughput (token/s): 33.48, #queue-req: 0 [repeated 15x across cluster]
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=11793) [2026-04-28 09:22:55] Prefill batch, #new-seq: 1, #new-token: 38, #cached-token: 12117, token usage: 0.17, #running-req: 5, #queue-req: 0, input throughput (token/s): 173.15, cuda graph: False [repeated 116x across cluster]
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=11792) [2026-04-28 09:22:56] INFO: 10.29.240.42:56064 - "POST /generate HTTP/1.1" 200 OK [repeated 155x across cluster]
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=11794) [2026-04-28 09:23:00] Decode batch, #running-req: 3, #token: 44593, token usage: 0.10, cuda graph: True, gen throughput (token/s): 36.66, #queue-req: 0 [repeated 21x across cluster]
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=11795) [2026-04-28 09:23:00] Prefill batch, #new-seq: 1, #new-token: 56, #cached-token: 12282, token usage: 0.09, #running-req: 2, #queue-req: 0, input throughput (token/s): 4402.72, cuda graph: False [repeated 144x across cluster]
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=18719) [2026-04-28 09:23:01] INFO: 10.29.240.42:47380 - "POST /generate HTTP/1.1" 200 OK [repeated 147x across cluster]
[dlczq3cay47mtefu-master-0] (RolloutManager pid=11105) [2026-04-28 09:23:04] acot_rollout.py:93 - [acot_rollout] active=18 inflight=31 reqs=168990 ctrl=83014 reas=85976 done=366
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=11794) [2026-04-28 09:23:05] Decode batch, #running-req: 1, #token: 14079, token usage: 0.03, cuda graph: True, gen throughput (token/s): 23.74, #queue-req: 0 [repeated 19x across cluster]
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=11795) [2026-04-28 09:23:05] Prefill batch, #new-seq: 1, #new-token: 34, #cached-token: 14495, token usage: 0.07, #running-req: 1, #queue-req: 0, input throughput (token/s): 61.16, cuda graph: False [repeated 136x across cluster]
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=18719) [2026-04-28 09:23:06] INFO: 10.29.240.42:47384 - "POST /generate HTTP/1.1" 200 OK [repeated 149x across cluster]
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=18719) [2026-04-28 09:23:10] Decode batch, #running-req: 12, #token: 87327, token usage: 0.09, cuda graph: True, gen throughput (token/s): 248.22, #queue-req: 0 [repeated 21x across cluster]
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=18719) [2026-04-28 09:23:11] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 7847, token usage: 0.11, #running-req: 12, #queue-req: 0, input throughput (token/s): 20.89, cuda graph: False [repeated 119x across cluster]
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=18719) [2026-04-28 09:23:11] INFO: 10.29.240.42:47720 - "POST /generate HTTP/1.1" 200 OK [repeated 142x across cluster]
[dlczq3cay47mtefu-master-0] (RolloutManager pid=11105) [2026-04-28 09:23:14] acot_rollout.py:93 - [acot_rollout] active=18 inflight=30 reqs=169256 ctrl=83144 reas=86112 done=366
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=18719) [2026-04-28 09:23:16] Decode batch, #running-req: 7, #token: 39238, token usage: 0.04, cuda graph: True, gen throughput (token/s): 330.77, #queue-req: 0 [repeated 20x across cluster]
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=18722) [2026-04-28 09:23:15] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 7112, token usage: 0.05, #running-req: 6, #queue-req: 0, input throughput (token/s): 33.14, cuda graph: False [repeated 101x across cluster]
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=18722) [2026-04-28 09:23:16] INFO: 10.29.240.42:36666 - "POST /generate HTTP/1.1" 200 OK [repeated 106x across cluster]
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=18722) [2026-04-28 09:23:21] INFO: 10.29.240.42:42394 - "POST /generate HTTP/1.1" 200 OK [repeated 8x across cluster]
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=18722) [2026-04-28 09:23:21] Decode batch, #running-req: 1, #token: 8586, token usage: 0.01, cuda graph: True, gen throughput (token/s): 53.95, #queue-req: 0 [repeated 10x across cluster]
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=11793) [2026-04-28 09:23:21] Prefill batch, #new-seq: 1, #new-token: 52, #cached-token: 12823, token usage: 0.03, #running-req: 0, #queue-req: 0, input throughput (token/s): 7.71, cuda graph: False
[dlczq3cay47mtefu-master-0] (RolloutManager pid=11105) [2026-04-28 09:23:25] acot_rollout.py:93 - [acot_rollout] active=15 inflight=25 reqs=169302 ctrl=83171 reas=86131 done=369
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=18719) [2026-04-28 09:23:27] Decode batch, #running-req: 1, #token: 7711, token usage: 0.01, cuda graph: True, gen throughput (token/s): 33.73, #queue-req: 0 [repeated 6x across cluster]
[dlczq3cay47mtefu-master-0] (RolloutManager pid=11105)
[dlczq3cay47mtefu-master-0] Rollout generation: 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 96/128 [23:05<06:36, 12.40s/it]
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=18722) [2026-04-28 09:23:29] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 8054, token usage: 0.01, #running-req: 0, #queue-req: 0, input throughput (token/s): 0.08, cuda graph: False
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=18720) [2026-04-28 09:23:29] INFO: 10.29.240.42:45660 - "POST /generate HTTP/1.1" 200 OK [repeated 4x across cluster]
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=11795) [2026-04-28 09:23:32] Decode batch, #running-req: 2, #token: 26469, token usage: 0.06, cuda graph: True, gen throughput (token/s): 64.14, #queue-req: 0 [repeated 13x across cluster]
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=18719) [2026-04-28 09:23:34] Prefill batch, #new-seq: 1, #new-token: 4, #cached-token: 6863, token usage: 0.08, #running-req: 10, #queue-req: 0, input throughput (token/s): 19.21, cuda graph: False [repeated 102x across cluster]
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=11792) [2026-04-28 09:23:34] INFO: 10.29.240.42:49204 - "POST /generate HTTP/1.1" 200 OK [repeated 109x across cluster]
[dlczq3cay47mtefu-master-0] (RolloutManager pid=11105) [2026-04-28 09:23:35] acot_rollout.py:93 - [acot_rollout] active=15 inflight=22 reqs=169441 ctrl=83231 reas=86210 done=369
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=18722) [2026-04-28 09:23:37] Decode batch, #running-req: 3, #token: 23573, token usage: 0.02, cuda graph: True, gen throughput (token/s): 87.20, #queue-req: 0 [repeated 21x across cluster]
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=11792) [2026-04-28 09:23:39] Prefill batch, #new-seq: 1, #new-token: 26, #cached-token: 16064, token usage: 0.04, #running-req: 0, #queue-req: 0, input throughput (token/s): 47.34, cuda graph: False [repeated 104x across cluster]
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=11793) [2026-04-28 09:23:39] INFO: 10.29.240.42:46792 - "POST /generate HTTP/1.1" 200 OK [repeated 112x across cluster]
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=18719) [2026-04-28 09:23:42] Decode batch, #running-req: 10, #token: 64645, token usage: 0.07, cuda graph: True, gen throughput (token/s): 185.35, #queue-req: 0 [repeated 20x across cluster]
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=11792) [2026-04-28 09:23:44] Prefill batch, #new-seq: 1, #new-token: 51, #cached-token: 13549, token usage: 0.03, #running-req: 0, #queue-req: 0, input throughput (token/s): 64.94, cuda graph: False [repeated 98x across cluster]
[dlczq3cay47mtefu-master-0] (SGLangEngine pid=11792) [2026-04-28 09:23:44] INFO: 10.29.240.42:49220 - "POST /generate HTTP/1.1" 200 OK [repeated 108x across cluster]
[dlczq3cay47mtefu-master-0] (RolloutManager pid=11105) [2026-04-28 09:23:45] acot_rollout.py:93 - [acot_rollout] active=15 inflight=21 reqs=169658 ctrl=83338 reas=86320 done=369

This comment has been hidden (marked as Resolved)
Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment