accelerate
torch-native pipeline parallelism for big models
#2345
Merged

torch-native pipeline parallelism for big models #2345

muellerzr merged 39 commits into main from pippy-integration-v2
muellerzr
muellerzr1 year ago (edited 1 year ago)🚀 1

Example use:

import torch
from accelerate.inference import prepare_pippy
from accelerate.utils import set_seed
from transformers import T5ForConditionalGeneration, T5Config

set_seed(42)

config = T5Config()
model = T5ForConditionalGeneration(config)
model.eval()

# Create example inputs for the model
input = torch.randint(
    low=0,
    high=config.vocab_size,
    size=(2, 1024),  # bs x seq_len
    device="cpu",
    dtype=torch.int64,
    requires_grad=False,
)

example_inputs = {"input_ids": input, "decoder_input_ids": input}

model = prepare_pippy(model, example_kwargs=example_inputs)

args = (
    example_inputs["input_ids"].to("cuda:0"),
    example_inputs["decoder_input_ids"].to("cuda:0")
)
with torch.no_grad():
    output = model(*args)

Speed up:

Using 2x4090's in full precision

Bert

Accelerate/Sequential PiPPy + Accelerate
First batch 0.2137s 0.3119s
Average of 5 batches 0.0099s 0.0062s

GPT2

Accelerate/Sequential PiPPy + Accelerate
First batch 0.1959s 0.4189s
Average of 5 batches 0.0205s 0.0126s

T5

Accelerate/Sequential PiPPy + Accelerate
First batch 0.2789s 0.3809s
Average of 5 batches 0.0198s 0.0166s
muellerzr Broken version
e713e28e
muellerzr Timing I would expect
2767bb19
muellerzr Working version!
06f04a99
muellerzr muellerzr marked this pull request as draft 1 year ago
HuggingFaceDocBuilderDev
HuggingFaceDocBuilderDev1 year ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc
SunMarc commented on 2024-01-16
SunMarc1 year ago

Very cool API ! I like the design and how easily it is to use. I left a few comments around the split_points mainly.

src/accelerate/inference.py
90 def forward(*args, **kwargs):
91 return model_forward(*args, **kwargs)
92
93
# To act like a decorator so that it can be popped when doing `extract_model_from_parallel`
94
forward.__wrapped__ = model_forward
SunMarc1 year ago

nice !

src/accelerate/inference.py
21 """
22 Calculates the device map for `model` with an offset for PiPPy
23 """
24
no_split_module_classes = getattr(model, "_no_split_modules", [])
25
if num_processes == 1:
26
return infer_auto_device_map(model, no_split_module_classes=no_split_module_classes, clean_result=False)
27
model_size, shared = calculate_maximum_sizes(model)
28
29
# Split into `n` chunks for each GPU
30
memory = (model_size + shared[0]) / num_processes
31
memory = convert_bytes(memory)
32
value, ending = memory.split(" ")
33
34
# Add a chunk to deal with potential extra shared memory instances
35
memory = math.ceil(float(value)) * 1.1
36
memory = f"{memory} {ending}"
37
device_map = infer_auto_device_map(
38
model,
39
max_memory={i: memory for i in range(num_processes)},
40
no_split_module_classes=no_split_module_classes,
41
clean_result=False,
42
)
SunMarc1 year ago👍 1

We can definitely generate a balanced device_map for pippy exclusively "device_map = "balanced_pippy" if the current balanced option is not the best for that. However, I think it would be great if the user can use other options like "sequential". I didn't try but what happens when we only fill 2 gpus out of the 4 available (possible sequential case) ?

Conversation is marked as resolved
Show resolved
src/accelerate/inference.py
21 """
22 Calculates the device map for `model` with an offset for PiPPy
23 """
24
no_split_module_classes = getattr(model, "_no_split_modules", [])
SunMarc1 year ago👍 1

Let's add no_split_module_classes as an optional arg like in the big model inference. The _no_split_modules is used only in transformers and I don't think we should infer it in accelerate. Moreover, we don't use it directly since we need to concat the _no_split_modules from submodules through the _get_no_split_modules method

src/accelerate/inference.py
78 """
79 example_args = send_to_device(example_args, "cpu")
80 example_kwargs = send_to_device(example_kwargs, "cpu")
81
if device_map == "auto":
82
device_map = generate_device_map(model, PartialState().num_processes)
83
stage = build_pipeline(model, device_map, example_args, example_kwargs)
SunMarc1 year ago👍 1

Just a thought about how to handle the split points.

    1. We only expose device_map with predefined options ("sequential", "balanced_pippy")
    1. We let the user use a custom device_map. For the custom case, it can be complicated since the user needs to be careful about the order (OrderedDict()) and he needs to attribute the gpu in a sequential manner because of that split_points.append(next(k for k, v in device_map.items() if v == i)). So that can be quite complicated.
    1. We let the user let his own split points List[str].
      I think that 1) is a must. between 2) and 3), I prefer 3) since it is easier for the user.
muellerzr1 year ago

Agreed to do 1 and 3

Conversation is marked as resolved
Show resolved
src/accelerate/inference.py
43 return device_map
44
45
46
def build_pipeline(model, device_map, args, kwargs) -> PipelineStage:
47
"""
48
Attaches the split points to the model based on `self.device_map` and generates a `PipelineStage`. Requires passing
49
in needed `args` and `kwargs` as the model needs on the CPU.
50
"""
51
# We need to annotate the split points in the model for PiPPy
52
state = PartialState()
53
split_points = []
54
for i in range(1, state.num_processes):
55
split_points.append(next(k for k, v in device_map.items() if v == i))
SunMarc1 year ago👍 1

Let's pass split_points in build_pipeline instead of device_map. WDYT ? It will be better if we decide to let the user pass split points.

muellerzr1 year ago

Not sure if I like this yet, it seems clunky in implementation but I'll keep at it.

Conversation is marked as resolved
Show resolved
src/accelerate/inference.py
79 example_args = send_to_device(example_args, "cpu")
80 example_kwargs = send_to_device(example_kwargs, "cpu")
81 if device_map == "auto":
82
device_map = generate_device_map(model, PartialState().num_processes)
SunMarc1 year ago👍 1

For debug purpose, I'm thinking that we could save the split_points as model.hf_split_points. Not sure about the device_map as model.hf_device_map since if we let the user pass the split_points, we might don't have the device_map. However, I think that we should be able to recreate it using split_points.

muellerzr1 year ago

Agreed, we can just keep them as hf_split_points.

SunMarc
SunMarc commented on 2024-01-16
Conversation is marked as resolved
Show resolved
src/accelerate/inference.py
93 # To act like a decorator so that it can be popped when doing `extract_model_from_parallel`
94 forward.__wrapped__ = model_forward
95 model.forward = forward
96
return stage
SunMarc1 year ago👀 1

You are returning the stage. I think i should be the model no ?

muellerzr Use MethodType
9eef9dd4
muellerzr muellerzr force pushed from 66fb6116 to 9eef9dd4 1 year ago
muellerzr working test
449eb8d9
muellerzr Tests
77f8e92b
muellerzr Use no split module classes explicitly
df7779aa
muellerzr Put split_points in pipelien
e3f6b99b
muellerzr Store split points in hf_split_points
8792a8c5
SunMarc fix case num_process=1
7ca4bccf
muellerzr
muellerzr commented on 2024-01-18
Conversation is marked as resolved
Show resolved
src/accelerate/inference.py
5959def pippy_forward(forward, *args, **kwargs):
6060 state = PartialState()
6161 output = None
62 if state.is_local_main_process:
62
if state.num_processes == 1:
63
output = forward(*args, **kwargs)
muellerzr1 year ago

Nice!

SunMarc
SunMarc approved these changes on 2024-01-24
SunMarc1 year ago (edited 1 year ago)

The API is in good shape ! Let's document the main functions a bit and we can merge it. I left a few comments but nothing blocking.

Conversation is marked as resolved
Show resolved
src/accelerate/inference.py
14)
15
16
17
ParallelMode = Literal["sequential", "pipeline_parallel"]
SunMarc1 year ago

To remove if not used

Conversation is marked as resolved
Show resolved
src/accelerate/inference.py
17ParallelMode = Literal["sequential", "pipeline_parallel"]
18
19
20
def generate_device_map(model, num_processes: int = 1, no_split_module_classes=None):
21
"""
22
Calculates the device map for `model` with an offset for PiPPy
23
"""
24
if num_processes == 1:
25
return infer_auto_device_map(model, no_split_module_classes=no_split_module_classes, clean_result=False)
26
model_size, shared = calculate_maximum_sizes(model)
27
28
# Split into `n` chunks for each GPU
29
memory = (model_size + shared[0]) / num_processes
30
memory = convert_bytes(memory)
31
value, ending = memory.split(" ")
32
33
# Add a chunk to deal with potential extra shared memory instances
34
memory = math.ceil(float(value)) * 1.1
35
memory = f"{memory} {ending}"
36
device_map = infer_auto_device_map(
37
model,
38
max_memory={i: memory for i in range(num_processes)},
39
no_split_module_classes=no_split_module_classes,
40
clean_result=False,
41
)
42
return device_map
SunMarc1 year ago

Let's add the max_memory arg, so that the user can set it by himself if we want it. So if max_memory is not None and num_processes != 1, we would change its value with the above allocation. Then call infer_auto_device_map once the end of the function.

src/accelerate/inference.py
77 state = PartialState()
78 example_args = send_to_device(example_args, "cpu")
79 example_kwargs = send_to_device(example_kwargs, "cpu")
80
if split_points == "auto":
81
device_map = generate_device_map(model, state.num_processes, no_split_module_classes=no_split_module_classes)
82
split_points = []
83
for i in range(1, state.num_processes):
84
split_points.append(next(k for k, v in device_map.items() if v == i))
SunMarc1 year ago

it would be great to have a sanity check, to make sure that we indeed have self.num_processes split points when we are generating the split_points + when the user manually pass them

Conversation is marked as resolved
Show resolved
src/accelerate/inference.py
92 return pippy_forward(stage.forward, *args, **kwargs)
93
94 # To act like a decorator so that it can be popped when doing `extract_model_from_parallel`
95
model_forward = MethodType(forward, model)
96
forward.__wrapped__ = model_forward
SunMarc1 year ago👍 1

We can keep them for now but these lines create an infinite recursion loop with generate

muellerzr1 year ago

Made a note about this

Conversation is marked as resolved
Show resolved
src/accelerate/inference.py
75 Wraps `model` for PipelineParallelism
76 """
77 state = PartialState()
78
example_args = send_to_device(example_args, "cpu")
79
example_kwargs = send_to_device(example_kwargs, "cpu")
SunMarc1 year ago👍 1

The api from pippy may change but we should keep in mind that dealing with kwarg is a mess in pippy. Passed kwargs are expected but it applies only for tensor args. The other args are "burned".

muellerzr1 year ago

Made a note in the docstring warning users. We can do so more thoroughly in the true documetation

SunMarc
SunMarc commented on 2024-01-24
Conversation is marked as resolved
Show resolved
src/accelerate/inference.py
50 # We need to annotate the split points in the model for PiPPy
51 state = PartialState()
52 annotate_split_points(model, {split_point: PipeSplitWrapper.SplitPoint.BEGINNING for split_point in split_points})
53
pipe = Pipe.from_tracing(model, num_chunks=state.num_processes, example_args=args, example_kwargs=kwargs)
SunMarc1 year ago (edited 1 year ago)👍 1

We are hardcoding num_chunks by state.num_processes. This is def a good since Ke that we would need at least num_processes for the pipeline to have good overlap. However, we should maybe let the user change this value and test it by themselves. I think this is a really important arg and the user should try to understand its effect since it impact the performance of pippy

muellerzr1 year ago

Documented

muellerzr Allow for dynamic batch padding (#2352)
dac1daa9
muellerzr Rm literal
364c3b61
muellerzr Allow users to pass in max_memory
6a8479b2
muellerzr Note about recursion
303c9cc7
muellerzr Document, document, document
d497e8af
muellerzr Right import check
06bbc5b5
muellerzr Merge branch 'main' into pippy-integration-v2
5e047da8
muellerzr Fix bug, add tests to multigpu runners
a5059e62
muellerzr Change default to None
71346a19
kwen2501
kwen2501 approved these changes on 2024-01-31
kwen25011 year ago

Thanks a lot for the integration effort!
LGTM!

SunMarc
SunMarc approved these changes on 2024-01-31
SunMarc1 year ago

Thx for iterating ! LGTM

Conversation is marked as resolved
Show resolved
src/accelerate/inference.py
100
101
102def prepare_pippy(
103
model, split_points="auto", no_split_module_classes=None, example_args=(), example_kwargs={}, num_chunks=None
104
):
SunMarc1 year ago👍 2

Learned from @fxmarty that it is dangerous to set a default dictionary. https://stackoverflow.com/questions/26320899/why-is-the-empty-dictionary-a-dangerous-default-value-in-python
Let's set it to example_kwargs: Optional[Dict[str, Any]] = None just like in from_tracing docstring.

Conversation is marked as resolved
Show resolved
src/accelerate/inference.py
88 if found_batch_size is None:
89 raise ValueError("Could not find batch size from args or kwargs")
90 else:
91
if found_batch_size != state.num_processes:
92
args = pad_input_tensors(args, found_batch_size, state.num_processes)
93
kwargs = pad_input_tensors(kwargs, found_batch_size, state.num_processes)
SunMarc1 year ago👍 1

I think we should replace state.num_processes by num_chunk

Conversation is marked as resolved
Show resolved
tests/test_utils.py
300 batch_size = 61
301 batch = torch.rand(batch_size, 4, 4)
302 result = pad_input_tensors(batch, batch_size, num_processes)
303
# We should expect there to be 6 items now
SunMarc1 year ago
Suggested change
# We should expect there to be 6 items now
# We should expect there to be 66 items now
Conversation is marked as resolved
Show resolved
src/accelerate/inference.py
119 The expected inputs for the model that uses dictionary-based inputs. This is a *highly* limiting structure
120 that requires the same keys be present at *all* inference calls. Not recommended unless the prior condition
121 is true for all cases.
122
num_chunks (`int`):
123
The number of different stages the Pipeline will have. By default it will assign one chunk per GPU, but
124
this can be tuned and played with. In general one should have num_chunks > num_gpus.
SunMarc1 year ago

Hi @kwen2501, could you help us to better understand what this arg do ? Is this the same arg as mentioned in this doc ? Let's say we set a num_chunks of 4. It will split the data of batch_size n into 4 chunks of batch_size n/4 ? And if we set num_chunks=1, we have the naive MP case. It that right ?

kwen25011 year ago❤ 2

That's right. Same as the definition in the doc you linked, i.e. how many microbatches we cut the input batch into.

Note that num_chunks is not necessarily the number of pipeline stages. For example, you can still have num_chunks=1 but 4 pipeline stages. That 1 chunk would pass through the 4 stages in series. But since there are no more data chunks, a stage (GPU) would become idle once the chunk passes through it.

kwen25011 year ago (edited 1 year ago)

That said, I agree with the comment that says:

In general one should have num_chunks > num_gpus

(maybe adding a "=" sign too)

SunMarc1 year ago

Thanks for clarifying @kwen2501 !

muellerzr Start of docs
fe66b937
muellerzr Try again?
d2af472b
muellerzr Try again x2
8dc6c6c8
muellerzr Trailing comma
4d0aeb2b
muellerzr Move import
309b71a6
muellerzr Clean
9f561f11
muellerzr typehint
d5a6fda3
muellerzr typo
954a668e
kwen2501
kwen2501 approved these changes on 2024-02-05
kwen25011 year ago❤ 1

Thanks for writing the doc so quick! Looks good to me!

Conversation is marked as resolved
Show resolved
docs/source/usage_guides/distributed_inference.md
2219
23## The Problem
201. Loading an entire model onto each GPU and sending chunks of a batch through each model at a time
212. Load parts of a model onto each GPU and process a single input at one time
22
3. Load parts of a model onto each GPU and used what is called scheduled Pipeline Parallelism to combine the two prior techniques.
kwen25011 year ago

used --> use

Conversation is marked as resolved
Show resolved
docs/source/usage_guides/distributed_inference.md
144
145This next part will discuss using *pipeline parallelism*. This is an **experimental** API utilizing the [PiPPy library by PyTorch](https://github.com/pytorch/PiPPy/) as a native solution.
146
147
The general idea with pipeline parallism is say you have 4 GPUs, and a model big enough it can be *split* on four GPUs using `device_map="auto"`. What this version will do is you can send in 4 inputs at at time (for example here, any amount works) and each models chunk will work on an ainput, then recieve the next input after the prior chunk finished it making it *much* more efficient **and faster** than the prior version. Here's a visual taken from the PyTorch repository:
kwen25011 year ago

each models chunk --> each model chunk
an ainput --> an input

Conversation is marked as resolved
Show resolved
docs/source/usage_guides/distributed_inference.md
202From here all that's left is to actually perform the distributed inference!
203
204<Tip warning={true}>
205
When passing in inputs, while using `kwargs` are supported currently those are even *more* experimental, so it's highly recommended to just simply pass inptus in as a tuple of arguments.
kwen25011 year ago

inptus --> inputs

muellerzr From code review
853f5521
muellerzr Use num_chunks
1362e5c2
muellerzr Update tests/test_utils.py
68bd89b3
muellerzr muellerzr marked this pull request as ready for review 1 year ago
kwen2501
kwen2501 commented on 2024-02-05
Conversation is marked as resolved
Show resolved
src/accelerate/inference.py
9192 raise ValueError("Could not find batch size from args or kwargs")
9293 else:
93 if found_batch_size != state.num_processes:
94
if found_batch_size != num_chunks:
9495
args = pad_input_tensors(args, found_batch_size, state.num_processes)
9596
kwargs = pad_input_tensors(kwargs, found_batch_size, state.num_processes)
kwen25011 year ago👍 1

Do we want to pad with num_chunks or state.num_processes?

muellerzr1 year ago

Indeed should be num_chunks all the way through, bad copy/paste caught me :)

muellerzr Bad copy/paste
181fbda9
muellerzr muellerzr changed the title Pippy integration v2 torch-native pipeline parallelism for big models 1 year ago
muellerzr muellerzr requested a review from MKhalusova MKhalusova 1 year ago
muellerzr
muellerzr1 year ago

cc @MKhalusova for the docs!

muellerzr hf_split_points
9157cf16
MKhalusova
MKhalusova commented on 2024-02-06
MKhalusova1 year ago❤ 2

Nice work! I left a few comments to polish things in the docs a bit.

Conversation is marked as resolved
Show resolved
docs/source/package_reference/inference.md
1
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
MKhalusova1 year ago

If this is a new doc, you can put 2024 in the copyright

Conversation is marked as resolved
Show resolved
docs/source/usage_guides/distributed_inference.md
1717
18Distributed inference is a common use case, especially with natural language processing (NLP) models. Users often want to
19send a number of different prompts, each to a different GPU, and then get the results back. This also has other cases
20outside of just NLP, however for this tutorial we will focus on just this idea of each GPU receiving a different prompt,
21and then returning the results.
18Distributed inference can fall into three brackets:
2219
23## The Problem
20
1. Loading an entire model onto each GPU and sending chunks of a batch through each model at a time
MKhalusova1 year ago👍 1

Suggestion: "Loading an entire model onto each GPU and sending chunks of a batch through each GPU's model copy at a time"

I think it's a bit clearer this way.

Conversation is marked as resolved
Show resolved
docs/source/usage_guides/distributed_inference.md
18Distributed inference can fall into three brackets:
2219
23## The Problem
201. Loading an entire model onto each GPU and sending chunks of a batch through each model at a time
21
2. Load parts of a model onto each GPU and process a single input at one time
MKhalusova1 year ago
Suggested change
2. Load parts of a model onto each GPU and process a single input at one time
2. Loading parts of a model onto each GPU and processing a single input at one time
MKhalusova1 year ago

For consistency with the first item in the list, let's say "loading" instead of "load".

Conversation is marked as resolved
Show resolved
docs/source/usage_guides/distributed_inference.md
2219
23## The Problem
201. Loading an entire model onto each GPU and sending chunks of a batch through each model at a time
212. Load parts of a model onto each GPU and process a single input at one time
22
3. Load parts of a model onto each GPU and use what is called scheduled Pipeline Parallelism to combine the two prior techniques.
MKhalusova1 year ago
Suggested change
3. Load parts of a model onto each GPU and use what is called scheduled Pipeline Parallelism to combine the two prior techniques.
3. Loading parts of a model onto each GPU and using what is called scheduled Pipeline Parallelism to combine the two prior techniques.
Conversation is marked as resolved
Show resolved
docs/source/usage_guides/distributed_inference.md
212. Load parts of a model onto each GPU and process a single input at one time
223. Load parts of a model onto each GPU and use what is called scheduled Pipeline Parallelism to combine the two prior techniques.
23
24
We're going to go through the first and the last, showcasing how to do each as they are more realistic scenarios.
MKhalusova1 year ago
Suggested change
We're going to go through the first and the last, showcasing how to do each as they are more realistic scenarios.
We're going to go through the first and the last bracket, showcasing how to do each as they are more realistic scenarios.
Conversation is marked as resolved
Show resolved
docs/source/usage_guides/distributed_inference.md
24We're going to go through the first and the last, showcasing how to do each as they are more realistic scenarios.
25
26
27
## Sending chunks of inputs automatically to each loaded model
MKhalusova1 year ago
Suggested change
## Sending chunks of inputs automatically to each loaded model
## Sending chunks of a batch automatically to each loaded model
MKhalusova1 year ago

To be consistent with the first item in the list

Conversation is marked as resolved
Show resolved
docs/source/usage_guides/distributed_inference.md
26
27## Sending chunks of inputs automatically to each loaded model
28
29
This is the most memory-intensive solution, as it requires each GPU keeps a full copy of the model in memory at a given time.
MKhalusova1 year ago
Suggested change
This is the most memory-intensive solution, as it requires each GPU keeps a full copy of the model in memory at a given time.
This is the most memory-intensive solution, as it requires each GPU to keep a full copy of the model in memory at a given time.
Conversation is marked as resolved
Show resolved
docs/source/usage_guides/distributed_inference.md
135140On the first GPU, the prompts will be `["a dog", "a cat"]`, and on the second GPU it will be `["a chicken", "a chicken"]`.
136141Make sure to drop the final sample, as it will be a duplicate of the previous one.
142
143
## A more memory-efficient version (experimental)
MKhalusova1 year ago

This is a vague title. Let's rename to something a bit more clear.
E.g. "Memory-efficient pipeline parallelism (experimental)"

Conversation is marked as resolved
Show resolved
docs/source/usage_guides/distributed_inference.md
144
145This next part will discuss using *pipeline parallelism*. This is an **experimental** API utilizing the [PiPPy library by PyTorch](https://github.com/pytorch/PiPPy/) as a native solution.
146
147
The general idea with pipeline parallism is say you have 4 GPUs, and a model big enough it can be *split* on four GPUs using `device_map="auto"`. What this version will do is you can send in 4 inputs at at time (for example here, any amount works) and each model chunk will work on an input, then recieve the next input after the prior chunk finished it making it *much* more efficient **and faster** than the prior version. Here's a visual taken from the PyTorch repository:
MKhalusova1 year ago
Suggested change
The general idea with pipeline parallism is say you have 4 GPUs, and a model big enough it can be *split* on four GPUs using `device_map="auto"`. What this version will do is you can send in 4 inputs at at time (for example here, any amount works) and each model chunk will work on an input, then recieve the next input after the prior chunk finished it making it *much* more efficient **and faster** than the prior version. Here's a visual taken from the PyTorch repository:
The general idea with pipeline parallelism is: say you have 4 GPUs and a model big enough it can be *split* on four GPUs using `device_map="auto"`. With this method you can send in 4 inputs at a time (for example here, any amount works) and each model chunk will work on an input, then receive the next input once the prior chunk finished, making it *much* more efficient **and faster** than the method described earlier. Here's a visual taken from the PyTorch repository:
Conversation is marked as resolved
Show resolved
docs/source/usage_guides/distributed_inference.md
148
149![PiPPy example](https://camo.githubusercontent.com/681d7f415d6142face9dd1b837bdb2e340e5e01a58c3a4b119dea6c0d99e2ce0/68747470733a2f2f692e696d6775722e636f6d2f657955633934372e706e67)
150
151
To use this with Accelerate, we have created a [model zoo](https://github.com/muellerzr/pippy-device-map-playground/) showcasing a number of different models and situations to do so. In this tutorial we'll take GPT2 however across two gpus.
MKhalusova1 year ago
Suggested change
To use this with Accelerate, we have created a [model zoo](https://github.com/muellerzr/pippy-device-map-playground/) showcasing a number of different models and situations to do so. In this tutorial we'll take GPT2 however across two gpus.
To illustrate how you can use this with Accelerate, we have created a [model zoo example](https://github.com/muellerzr/pippy-device-map-playground/) showcasing a number of different models and situations. In this tutorial, we'll show this method for GPT2 across two GPUs.
Conversation is marked as resolved
Show resolved
docs/source/usage_guides/distributed_inference.md
150
151To use this with Accelerate, we have created a [model zoo](https://github.com/muellerzr/pippy-device-map-playground/) showcasing a number of different models and situations to do so. In this tutorial we'll take GPT2 however across two gpus.
152
153
Before anything, please make sure you have the latest pippy installed by performing:
MKhalusova1 year ago
Suggested change
Before anything, please make sure you have the latest pippy installed by performing:
Before you proceed, please make sure you have the latest pippy installed by running the following:
Conversation is marked as resolved
Show resolved
docs/source/usage_guides/distributed_inference.md
156pip install torchpippy
157```
158
159
We require at least version 0.2.0, please perform `pip show torchpippy` to check this!
MKhalusova1 year ago
Suggested change
We require at least version 0.2.0, please perform `pip show torchpippy` to check this!
We require at least version 0.2.0. To confirm that you have the correct version, run `pip show torchpippy`.
Conversation is marked as resolved
Show resolved
docs/source/usage_guides/distributed_inference.md
158
159We require at least version 0.2.0, please perform `pip show torchpippy` to check this!
160
161
First we need to create the model on the CPU:
MKhalusova1 year ago
Suggested change
First we need to create the model on the CPU:
Start by creating the model on the CPU:
Conversation is marked as resolved
Show resolved
docs/source/usage_guides/distributed_inference.md
168model.eval()
169```
170
171
Next we need to create some example inputs to use. These help PiPPy trace the model.
MKhalusova1 year ago
Suggested change
Next we need to create some example inputs to use. These help PiPPy trace the model.
Next you'll need to create some example inputs to use. These help PiPPy trace the model.
Conversation is marked as resolved
Show resolved
docs/source/usage_guides/distributed_inference.md
172
173<Tip warning={true}>
174 However you make this example will determine the relative batch size that will be used/passed
175
through the model at a given time, so make sure to remember them!
MKhalusova1 year ago

I'm not sure I understand what "them" refers to in "make sure to remember them"

Conversation is marked as resolved
Show resolved
docs/source/usage_guides/distributed_inference.md
185 requires_grad=False,
186)
187```
188
Next we need to actually perform the tracing and get the model ready. To do so you simply use the [`inference.prepare_pippy`] function and it will fully wrap the model for pipeline parallism automatically:
MKhalusova1 year ago
Suggested change
Next we need to actually perform the tracing and get the model ready. To do so you simply use the [`inference.prepare_pippy`] function and it will fully wrap the model for pipeline parallism automatically:
Next we need to actually perform the tracing and get the model ready. To do so, use the [`inference.prepare_pippy`] function and it will fully wrap the model for pipeline parallelism automatically:
Conversation is marked as resolved
Show resolved
docs/source/usage_guides/distributed_inference.md
195
196<Tip>
197 There are a variety of parameters you can pass through to `prepare_pippy`:
198
* `split_points` will let you determine where to split the model at. By default we use wherever `device_map="auto" declares
MKhalusova1 year ago
Suggested change
* `split_points` will let you determine where to split the model at. By default we use wherever `device_map="auto" declares
* `split_points` lets you determine where to split the model. By default, we use wherever `device_map="auto" declares
MKhalusova1 year ago

Declares what? Maybe give an example?

muellerzr1 year ago

Done!

Conversation is marked as resolved
Show resolved
docs/source/usage_guides/distributed_inference.md
193model = prepare_pippy(model, example_args=(input,))
194```
195
196
<Tip>
MKhalusova1 year ago
Suggested change
<Tip>
<Tip>
MKhalusova1 year ago

Added a newline, so that the content inside the tip would render properly

Conversation is marked as resolved
Show resolved
docs/source/usage_guides/distributed_inference.md
196<Tip>
197 There are a variety of parameters you can pass through to `prepare_pippy`:
198 * `split_points` will let you determine where to split the model at. By default we use wherever `device_map="auto" declares
199
* `num_chunks` can be used to determine how the batch will be split and sent to the model itself (so `num_chunks=1` with four split points/four GPUs would have a naive MP where a single input gets passed between the four layer split points)
MKhalusova1 year ago
Suggested change
* `num_chunks` can be used to determine how the batch will be split and sent to the model itself (so `num_chunks=1` with four split points/four GPUs would have a naive MP where a single input gets passed between the four layer split points)
* `num_chunks` determines how the batch will be split and sent to the model itself (so `num_chunks=1` with four split points/four GPUs will have a naive MP where a single input gets passed between the four layer split points)
Conversation is marked as resolved
Show resolved
docs/source/usage_guides/distributed_inference.md
199 * `num_chunks` can be used to determine how the batch will be split and sent to the model itself (so `num_chunks=1` with four split points/four GPUs would have a naive MP where a single input gets passed between the four layer split points)
200</Tip>
201
202
From here all that's left is to actually perform the distributed inference!
MKhalusova1 year ago
Suggested change
From here all that's left is to actually perform the distributed inference!
From here, all that's left is to actually perform the distributed inference!
Conversation is marked as resolved
Show resolved
docs/source/usage_guides/distributed_inference.md
201
202From here all that's left is to actually perform the distributed inference!
203
204
<Tip warning={true}>
MKhalusova1 year ago
Suggested change
<Tip warning={true}>
<Tip warning={true}>
Conversation is marked as resolved
Show resolved
docs/source/usage_guides/distributed_inference.md
202From here all that's left is to actually perform the distributed inference!
203
204<Tip warning={true}>
205
When passing in inputs, while using `kwargs` are supported currently those are even *more* experimental, so it's highly recommended to just simply pass inputs in as a tuple of arguments.
MKhalusova1 year ago
Suggested change
When passing in inputs, while using `kwargs` are supported currently those are even *more* experimental, so it's highly recommended to just simply pass inputs in as a tuple of arguments.
When passing inputs, we highly recommend to pass them in as a tuple of arguments. Using `kwargs` is supported, however, this approach is experimental.
Conversation is marked as resolved
Show resolved
docs/source/usage_guides/distributed_inference.md
211 output = model(*args)
212```
213
214
Afterwards all the data will be on the last GPU, which you can use the [`PartialState`] to find and extract:
MKhalusova1 year ago
Suggested change
Afterwards all the data will be on the last GPU, which you can use the [`PartialState`] to find and extract:
When finished, all the data will be on the last GPU, which you can use the [`PartialState`] to find and extract:
muellerzr Apply suggestions from code review
f2c6e088
muellerzr Year
9f204969
muellerzr Nit
e1961d6d
muellerzr better title
8c72a5e6
muellerzr Rephrase
3eaa9678
muellerzr Rephrase
31fcde4d
muellerzr
muellerzr1 year ago👍 1

Final comment before merging, things that still need to be done in a latter PR at some point (but okay not being in the first iteration of this joint effort):

  1. Specify balanced_pippy device map and allow a sequential device_map when making the pipeline via prepare_pippy
  2. Look into supporting model.generate() through an alternative hook into the model forward if possible
  3. Make sure all outputs end up on the CPU so users don't need to check at the end and we can call them via a .gather
  4. Migrate the pippy-device-map-playground examples over to here as part of our examples folder

(I'll be doing 3& 4 this week as a follow-up prior to release)

muellerzr Try spacing maybe?
7c3d1830
muellerzr muellerzr merged 0867c093 into main 1 year ago
muellerzr muellerzr deleted the pippy-integration-v2 branch 1 year ago

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone