pytorch
TCPStoreLibUvBackend: use somaxconn and enable TCP_NODELAY
#128739
Closed

TCPStoreLibUvBackend: use somaxconn and enable TCP_NODELAY #128739

d4l3k wants to merge 1 commit into main from d4l3k/uv_backlog
d4l3k
d4l3k337 days ago (edited 337 days ago)

This adjusts the settings of the libuv backend to match the older TCPStore.

  • DEFAULT_BACKLOG: setting this to -1 will enable using the host somaxconn value instead of a hardcoded 16k value. When going over this limit with tcp_abort_on_overflow set it results in connections being reset.
  • TCP_NODELAY: Since TCPStore primarily sends small messages there's no benefit to using Nargle's algorithm and it may add additional latency for store operations.

Test plan:

python test/distributed/test_store.py -v -k LibUv

Benchmark script:

import time
import os

import torch.distributed as dist

rank = int(os.environ["RANK"])

store = dist.TCPStore(
    host_name="<server>",
    port=29500,
    world_size=2,
    is_master=(rank == 0),
    use_libuv=True,
)

if rank == 1:
    total_iters = 0
    total_dur = 0
    for iter in range(10):
        iters = 500000
        start = time.perf_counter()
        for i in range(iters):
            store.set(f"key_{i}", f"value_{i}")
        dur = time.perf_counter() - start
        print(f"{iter}. {iters} set, qps = {iters/dur}")
        total_iters += iters
        total_dur += dur
    
    print(f"overall qps = {total_iters/total_dur}")
else:
    print("sleeping")
    time.sleep(1000000000)

Performance seems to be negligible difference between TCP_NODELAY and not for a single host

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang

pytorch-bot
pytorch-bot337 days ago (edited 337 days ago)

๐Ÿ”— Helpful Links

๐Ÿงช See artifacts and rendered test results at hud.pytorch.org/pr/128739

Note: Links to docs will display an error until the docs builds have been completed.

โœ… No Failures

As of commit 0975d5e with merge base 4b96575 (image):
๐Ÿ’š Looks good so far! There are no failures yet. ๐Ÿ’š

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot pytorch-bot added oncall: distributed
pytorch-bot pytorch-bot added release notes: distributed (c10d)
d4l3k d4l3k requested a review from rsdcastro rsdcastro 337 days ago
d4l3k d4l3k requested a review from c-p-i-o c-p-i-o 337 days ago
d4l3k d4l3k requested a review from kurman kurman 337 days ago
rsdcastro
rsdcastro approved these changes on 2024-06-14
torch/csrc/distributed/c10d/TCPStoreLibUvBackend.cpp
134137 public:
135138 explicit UvTcpSocket(uv_loop_t* loop) {
136139 uv_tcp_init(loop, &client);
140
if(int err = uv_tcp_nodelay(&client, 1)) {
rsdcastro337 days ago

do we have some microbenchmark code that we could use to validate the improvement (and no unexpected perf impact)?

d4l3k337 days ago

Yes, I'll do some more testing on this internally but I expect it'll be pretty safe.

We already set TCP_NODELAY on everything that uses socket.cpp so the older TCPStoreBackend and the current TCPStore client. Not setting this seems like an oversight when writing the new backend -- it's a bit weird to set it only on the client but not the server as well since it's asymmetric.

return setSocketFlag(IPPROTO_TCP, TCP_NODELAY, true);

d4l3k337 days ago

One more data point: we do internal buffering of messages so we only call write() once the entire message is ready to be sent. Thus we only call write once per request so there won't ever be another message that can be coalesced.

kurman
kurman approved these changes on 2024-06-14
kurman337 days ago

make sense. would it be useful for users to have an option to actually set the value instead of relying on OS?

d4l3k
d4l3k337 days ago๐Ÿ‘ 1

@kurman the OS setting is the global max so the value passed to listen() can only be used to shrink it more. I don't see a good use case for limiting this -- there's no case where we would want to intentionally overflow the backlog and there's no performance benefit to setting it lower. It may save a tiny amount of memory

We set it to -1 on the old TCPStoreBackend so this is matching existing behaviors.

facebook-github-bot
facebook-github-bot337 days ago

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

d4l3k TCPStoreLibUvBackend: use somaxconn and enable TCP_NODELAY
0975d5ed
d4l3k d4l3k force pushed to 0975d5ed 337 days ago
facebook-github-bot
facebook-github-bot337 days ago

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

wconstab
wconstab commented on 2024-06-14
torch/csrc/distributed/c10d/TCPStoreLibUvBackend.cpp
34// This controls how many un-accepted TCP connections can be waiting in the
35// backlog. This should be at least world size to avoid issues on init. We set
36// it to -1 to use the host max value which is controlled by `soconnmax`.
37
#define DEFAULT_BACKLOG -1
wconstab337 days ago

@XilunWu do you remember why we picked 16k? iirc this was changed by @kumpera early on in the libuv backend development, maybe we just didn't imagine needing larger than this.

d4l3k337 days ago

This DEFAULT_BACKLOG setting seems to be following the libuv example -- likely they didn't realize that -1 worked to default to max and just set it to be the max known world size?

https://github.com/libuv/libuv/blob/v1.x/docs/code/tcp-echo-server/main.c#L7

c-p-i-o
c-p-i-o approved these changes on 2024-06-14
facebook-github-bot
facebook-github-bot337 days ago๐Ÿ‘ 1

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

pytorchmergebot
pytorchmergebot337 days ago

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pytorchmergebot added merging
pytorchmergebot pytorchmergebot closed this 337 days ago
pytorchmergebot pytorchmergebot added Merged
pytorchmergebot pytorchmergebot removed merging
github-actions github-actions deleted the d4l3k/uv_backlog branch 306 days ago

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone