PR #128739 TCPStoreLibUvBackend: use somaxconn and enable TCP_NODELAY

d4l3k337 days ago (edited 337 days ago)

This adjusts the settings of the libuv backend to match the older TCPStore.

DEFAULT_BACKLOG: setting this to -1 will enable using the host somaxconn value instead of a hardcoded 16k value. When going over this limit with tcp_abort_on_overflow set it results in connections being reset.
TCP_NODELAY: Since TCPStore primarily sends small messages there's no benefit to using Nargle's algorithm and it may add additional latency for store operations.

Test plan:

python test/distributed/test_store.py -v -k LibUv

Benchmark script:

import time
import os

import torch.distributed as dist

rank = int(os.environ["RANK"])

store = dist.TCPStore(
    host_name="<server>",
    port=29500,
    world_size=2,
    is_master=(rank == 0),
    use_libuv=True,
)

if rank == 1:
    total_iters = 0
    total_dur = 0
    for iter in range(10):
        iters = 500000
        start = time.perf_counter()
        for i in range(iters):
            store.set(f"key_{i}", f"value_{i}")
        dur = time.perf_counter() - start
        print(f"{iter}. {iters} set, qps = {iters/dur}")
        total_iters += iters
        total_dur += dur
    
    print(f"overall qps = {total_iters/total_dur}")
else:
    print("sleeping")
    time.sleep(1000000000)

Performance seems to be negligible difference between TCP_NODELAY and not for a single host

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang

pytorch-bot337 days ago (edited 337 days ago)

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128739

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0975d5e with merge base 4b96575 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot added oncall: distributed

pytorch-bot added release notes: distributed (c10d)

d4l3k requested a review from

rsdcastro 337 days ago

d4l3k requested a review from

c-p-i-o 337 days ago

d4l3k requested a review from

kurman 337 days ago

rsdcastro approved these changes on 2024-06-14

torch/csrc/distributed/c10d/TCPStoreLibUvBackend.cpp

134	137	public:
135	138	explicit UvTcpSocket(uv_loop_t* loop) {
136	139	uv_tcp_init(loop, &client);
	140	if(int err = uv_tcp_nodelay(&client, 1)) {

rsdcastro337 days ago

do we have some microbenchmark code that we could use to validate the improvement (and no unexpected perf impact)?

d4l3k337 days ago

Yes, I'll do some more testing on this internally but I expect it'll be pretty safe.

We already set TCP_NODELAY on everything that uses socket.cpp so the older TCPStoreBackend and the current TCPStore client. Not setting this seems like an oversight when writing the new backend -- it's a bit weird to set it only on the client but not the server as well since it's asymmetric.

pytorch/torch/csrc/distributed/c10d/socket.cpp

Line 358 in 4b96575

return setSocketFlag(IPPROTO_TCP, TCP_NODELAY, true);

d4l3k337 days ago

One more data point: we do internal buffering of messages so we only call write() once the entire message is ready to be sent. Thus we only call write once per request so there won't ever be another message that can be coalesced.

pytorch/torch/csrc/distributed/c10d/TCPStoreLibUvBackend.cpp

Line 444 in 0975d5e

wd->send();

kurman approved these changes on 2024-06-14

kurman337 days ago

make sense. would it be useful for users to have an option to actually set the value instead of relying on OS?

d4l3k337 days ago👍 1

@kurman the OS setting is the global max so the value passed to listen() can only be used to shrink it more. I don't see a good use case for limiting this -- there's no case where we would want to intentionally overflow the backlog and there's no performance benefit to setting it lower. It may save a tiny amount of memory

We set it to -1 on the old TCPStoreBackend so this is matching existing behaviors.

facebook-github-bot337 days ago

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

TCPStoreLibUvBackend: use somaxconn and enable TCP_NODELAY

0975d5ed

d4l3k force pushed to 0975d5ed 337 days ago

facebook-github-bot337 days ago

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

wconstab commented on 2024-06-14

torch/csrc/distributed/c10d/TCPStoreLibUvBackend.cpp

	34		// This controls how many un-accepted TCP connections can be waiting in the
	35		// backlog. This should be at least world size to avoid issues on init. We set
	36		// it to -1 to use the host max value which is controlled by `soconnmax`.
	37		#define DEFAULT_BACKLOG -1

wconstab337 days ago

@XilunWu do you remember why we picked 16k? iirc this was changed by @kumpera early on in the libuv backend development, maybe we just didn't imagine needing larger than this.

d4l3k337 days ago

This DEFAULT_BACKLOG setting seems to be following the libuv example -- likely they didn't realize that -1 worked to default to max and just set it to be the max known world size?

https://github.com/libuv/libuv/blob/v1.x/docs/code/tcp-echo-server/main.c#L7

c-p-i-o approved these changes on 2024-06-14

facebook-github-bot337 days ago👍 1

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

pytorchmergebot337 days ago

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot added merging

pytorchmergebot closed this 337 days ago

pytorchmergebot added Merged

pytorchmergebot removed merging

github-actions deleted the d4l3k/uv_backlog branch 306 days ago

pytorch
TCPStoreLibUvBackend: use somaxconn and enable TCP_NODELAY
#128739

Closed

TCPStoreLibUvBackend: use somaxconn and enable TCP_NODELAY #128739

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128739

✅ No Failures

Merge started

pytorch TCPStoreLibUvBackend: use somaxconn and enable TCP_NODELAY #128739 Closed

TCPStoreLibUvBackend: use somaxconn and enable TCP_NODELAY #128739

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128739

✅ No Failures

Merge started

pytorch
TCPStoreLibUvBackend: use somaxconn and enable TCP_NODELAY
#128739

Closed