PR #2345 torch-native pipeline parallelism for big models

muellerzr1 year ago (edited 1 year ago)🚀 1

Example use:

import torch
from accelerate.inference import prepare_pippy
from accelerate.utils import set_seed
from transformers import T5ForConditionalGeneration, T5Config

set_seed(42)

config = T5Config()
model = T5ForConditionalGeneration(config)
model.eval()

# Create example inputs for the model
input = torch.randint(
    low=0,
    high=config.vocab_size,
    size=(2, 1024),  # bs x seq_len
    device="cpu",
    dtype=torch.int64,
    requires_grad=False,
)

example_inputs = {"input_ids": input, "decoder_input_ids": input}

model = prepare_pippy(model, example_kwargs=example_inputs)

args = (
    example_inputs["input_ids"].to("cuda:0"),
    example_inputs["decoder_input_ids"].to("cuda:0")
)
with torch.no_grad():
    output = model(*args)

Speed up:

Using 2x4090's in full precision

Bert

	Accelerate/Sequential	PiPPy + Accelerate
First batch	0.2137s	0.3119s
Average of 5 batches	0.0099s	0.0062s

GPT2

	Accelerate/Sequential	PiPPy + Accelerate
First batch	0.1959s	0.4189s
Average of 5 batches	0.0205s	0.0126s

T5

	Accelerate/Sequential	PiPPy + Accelerate
First batch	0.2789s	0.3809s
Average of 5 batches	0.0198s	0.0166s

Broken version

e713e28e

Timing I would expect

2767bb19

Working version!

06f04a99

muellerzr marked this pull request as draft 1 year ago

HuggingFaceDocBuilderDev1 year ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc commented on 2024-01-16

SunMarc1 year ago

Very cool API ! I like the design and how easily it is to use. I left a few comments around the split_points mainly.

src/accelerate/inference.py

	90		def forward(args, *kwargs):
	91		return model_forward(args, *kwargs)
	92
	93		# To act like a decorator so that it can be popped when doing `extract_model_from_parallel`
	94		forward.__wrapped__ = model_forward

SunMarc1 year ago

nice !

src/accelerate/inference.py

	21		"""
	22		Calculates the device map for `model` with an offset for PiPPy
	23		"""
	24		no_split_module_classes = getattr(model, "_no_split_modules", [])
	25		if num_processes == 1:
	26		return infer_auto_device_map(model, no_split_module_classes=no_split_module_classes, clean_result=False)
	27		model_size, shared = calculate_maximum_sizes(model)
	28
	29		# Split into `n` chunks for each GPU
	30		memory = (model_size + shared[0]) / num_processes
	31		memory = convert_bytes(memory)
	32		value, ending = memory.split(" ")
	33
	34		# Add a chunk to deal with potential extra shared memory instances
	35		memory = math.ceil(float(value)) * 1.1
	36		memory = f"{memory} {ending}"
	37		device_map = infer_auto_device_map(
	38		model,
	39		max_memory={i: memory for i in range(num_processes)},
	40		no_split_module_classes=no_split_module_classes,
	41		clean_result=False,
	42		)

SunMarc1 year ago👍 1

We can definitely generate a balanced device_map for pippy exclusively "device_map = "balanced_pippy" if the current balanced option is not the best for that. However, I think it would be great if the user can use other options like "sequential". I didn't try but what happens when we only fill 2 gpus out of the 4 available (possible sequential case) ?

Conversation is marked as resolved

Show resolved

src/accelerate/inference.py

	78		"""
	79		example_args = send_to_device(example_args, "cpu")
	80		example_kwargs = send_to_device(example_kwargs, "cpu")
	81		if device_map == "auto":
	82		device_map = generate_device_map(model, PartialState().num_processes)
	83		stage = build_pipeline(model, device_map, example_args, example_kwargs)

SunMarc1 year ago👍 1

Just a thought about how to handle the split points.

1. We only expose device_map with predefined options ("sequential", "balanced_pippy")
1. We let the user use a custom device_map. For the custom case, it can be complicated since the user needs to be careful about the order (OrderedDict()) and he needs to attribute the gpu in a sequential manner because of that split_points.append(next(k for k, v in device_map.items() if v == i)). So that can be quite complicated.
1. We let the user let his own split points List[str].
  I think that 1) is a must. between 2) and 3), I prefer 3) since it is easier for the user.

muellerzr1 year ago

Agreed to do 1 and 3

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

SunMarc commented on 2024-01-16

Conversation is marked as resolved

Show resolved

Use MethodType

9eef9dd4

muellerzr force pushed from 66fb6116 to 9eef9dd4 1 year ago

working test

449eb8d9

Tests

77f8e92b

Use no split module classes explicitly

df7779aa

Put split_points in pipelien

e3f6b99b

Store split points in hf_split_points

8792a8c5

fix case num_process=1

7ca4bccf

muellerzr commented on 2024-01-18

Conversation is marked as resolved

Show resolved

SunMarc approved these changes on 2024-01-24

SunMarc1 year ago (edited 1 year ago)

The API is in good shape ! Let's document the main functions a bit and we can merge it. I left a few comments but nothing blocking.

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

src/accelerate/inference.py

	77		state = PartialState()
	78		example_args = send_to_device(example_args, "cpu")
	79		example_kwargs = send_to_device(example_kwargs, "cpu")
	80		if split_points == "auto":
	81		device_map = generate_device_map(model, state.num_processes, no_split_module_classes=no_split_module_classes)
	82		split_points = []
	83		for i in range(1, state.num_processes):
	84		split_points.append(next(k for k, v in device_map.items() if v == i))

SunMarc1 year ago

it would be great to have a sanity check, to make sure that we indeed have self.num_processes split points when we are generating the split_points + when the user manually pass them

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

SunMarc commented on 2024-01-24

Conversation is marked as resolved

Show resolved

Allow for dynamic batch padding (#2352)

dac1daa9

Rm literal

364c3b61

Allow users to pass in max_memory

6a8479b2

Note about recursion

303c9cc7

Document, document, document

d497e8af

Right import check

06bbc5b5

Merge branch 'main' into pippy-integration-v2

5e047da8

Fix bug, add tests to multigpu runners

a5059e62

Change default to None

71346a19

kwen2501 approved these changes on 2024-01-31

kwen25011 year ago

Thanks a lot for the integration effort!
LGTM!

SunMarc approved these changes on 2024-01-31

SunMarc1 year ago

Thx for iterating ! LGTM

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Start of docs

fe66b937

Try again?

d2af472b

Try again x2

8dc6c6c8

Trailing comma

4d0aeb2b

Move import

309b71a6

Clean

9f561f11

typehint

d5a6fda3

typo

954a668e

kwen2501 approved these changes on 2024-02-05

kwen25011 year ago❤ 1

Thanks for writing the doc so quick! Looks good to me!

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

From code review

853f5521

Use num_chunks

1362e5c2

Update tests/test_utils.py

68bd89b3

muellerzr marked this pull request as ready for review 1 year ago

kwen2501 commented on 2024-02-05

Conversation is marked as resolved

Show resolved

Bad copy/paste

181fbda9

muellerzr changed the title ~~Pippy integration v2~~ torch-native pipeline parallelism for big models 1 year ago

muellerzr requested a review from

MKhalusova 1 year ago

muellerzr1 year ago

cc @MKhalusova for the docs!

hf_split_points

9157cf16

MKhalusova commented on 2024-02-06

MKhalusova1 year ago❤ 2

Nice work! I left a few comments to polish things in the docs a bit.

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Apply suggestions from code review

f2c6e088

Year

9f204969

Nit

e1961d6d

better title

8c72a5e6

Rephrase

3eaa9678

Rephrase

31fcde4d

muellerzr1 year ago👍 1

Final comment before merging, things that still need to be done in a latter PR at some point (but okay not being in the first iteration of this joint effort):

Specify balanced_pippy device map and allow a sequential device_map when making the pipeline via prepare_pippy
Look into supporting model.generate() through an alternative hook into the model forward if possible
Make sure all outputs end up on the CPU so users don't need to check at the end and we can call them via a .gather
Migrate the pippy-device-map-playground examples over to here as part of our examples folder

(I'll be doing 3& 4 this week as a follow-up prior to release)

Try spacing maybe?

7c3d1830

muellerzr merged 0867c093 into main 1 year ago

muellerzr deleted the pippy-integration-v2 branch 1 year ago

59	59	def pippy_forward(forward, args, *kwargs):
60	60	state = PartialState()
61	61	output = None
62		if state.is_local_main_process:
	62	if state.num_processes == 1:
	63	output = forward(args, *kwargs)

	14		)
	15
	16
	17		ParallelMode = Literal["sequential", "pipeline_parallel"]

	100
	101
	102		def prepare_pippy(
	103		model, split_points="auto", no_split_module_classes=None, example_args=(), example_kwargs={}, num_chunks=None
	104		):

	300		batch_size = 61
	301		batch = torch.rand(batch_size, 4, 4)
	302		result = pad_input_tensors(batch, batch_size, num_processes)
	303		# We should expect there to be 6 items now

	119		The expected inputs for the model that uses dictionary-based inputs. This is a highly limiting structure
	120		that requires the same keys be present at all inference calls. Not recommended unless the prior condition
	121		is true for all cases.
	122		num_chunks (`int`):
	123		The number of different stages the Pipeline will have. By default it will assign one chunk per GPU, but
	124		this can be tuned and played with. In general one should have num_chunks > num_gpus.

accelerate
torch-native pipeline parallelism for big models
#2345

Merged

torch-native pipeline parallelism for big models #2345

Example use:

Speed up:

Bert

GPT2

T5

	43		return device_map
	44
	45
	46		def build_pipeline(model, device_map, args, kwargs) -> PipelineStage:
	47		"""
	48		Attaches the split points to the model based on `self.device_map` and generates a `PipelineStage`. Requires passing
	49		in needed `args` and `kwargs` as the model needs on the CPU.
	50		"""
	51		# We need to annotate the split points in the model for PiPPy
	52		state = PartialState()
	53		split_points = []
	54		for i in range(1, state.num_processes):
	55		split_points.append(next(k for k, v in device_map.items() if v == i))

	93		# To act like a decorator so that it can be popped when doing `extract_model_from_parallel`
	94		forward.__wrapped__ = model_forward
	95		model.forward = forward
	96		return stage

	17		ParallelMode = Literal["sequential", "pipeline_parallel"]
	18
	19
	20		def generate_device_map(model, num_processes: int = 1, no_split_module_classes=None):
	21		"""
	22		Calculates the device map for `model` with an offset for PiPPy
	23		"""
	24		if num_processes == 1:
	25		return infer_auto_device_map(model, no_split_module_classes=no_split_module_classes, clean_result=False)
	26		model_size, shared = calculate_maximum_sizes(model)
	27
	28		# Split into `n` chunks for each GPU
	29		memory = (model_size + shared[0]) / num_processes
	30		memory = convert_bytes(memory)
	31		value, ending = memory.split(" ")
	32
	33		# Add a chunk to deal with potential extra shared memory instances
	34		memory = math.ceil(float(value)) * 1.1
	35		memory = f"{memory} {ending}"
	36		device_map = infer_auto_device_map(
	37		model,
	38		max_memory={i: memory for i in range(num_processes)},
	39		no_split_module_classes=no_split_module_classes,
	40		clean_result=False,
	41		)
	42		return device_map

	92		return pippy_forward(stage.forward, args, *kwargs)
	93
	94		# To act like a decorator so that it can be popped when doing `extract_model_from_parallel`
	95		model_forward = MethodType(forward, model)
	96		forward.__wrapped__ = model_forward

	75		Wraps `model` for PipelineParallelism
	76		"""
	77		state = PartialState()
	78		example_args = send_to_device(example_args, "cpu")
	79		example_kwargs = send_to_device(example_kwargs, "cpu")

	50		# We need to annotate the split points in the model for PiPPy
	51		state = PartialState()
	52		annotate_split_points(model, {split_point: PipeSplitWrapper.SplitPoint.BEGINNING for split_point in split_points})
	53		pipe = Pipe.from_tracing(model, num_chunks=state.num_processes, example_args=args, example_kwargs=kwargs)

	88		if found_batch_size is None:
	89		raise ValueError("Could not find batch size from args or kwargs")
	90		else:
	91		if found_batch_size != state.num_processes:
	92		args = pad_input_tensors(args, found_batch_size, state.num_processes)
	93		kwargs = pad_input_tensors(kwargs, found_batch_size, state.num_processes)

	# We should expect there to be 6 items now
	# We should expect there to be 66 items now

22	19
23		## The Problem
	20	1. Loading an entire model onto each GPU and sending chunks of a batch through each model at a time
	21	2. Load parts of a model onto each GPU and process a single input at one time
	22	3. Load parts of a model onto each GPU and used what is called scheduled Pipeline Parallelism to combine the two prior techniques.

	144
	145		This next part will discuss using pipeline parallelism. This is an experimental API utilizing the [PiPPy library by PyTorch](https://github.com/pytorch/PiPPy/) as a native solution.
	146
	147		The general idea with pipeline parallism is say you have 4 GPUs, and a model big enough it can be split on four GPUs using `device_map="auto"`. What this version will do is you can send in 4 inputs at at time (for example here, any amount works) and each models chunk will work on an ainput, then recieve the next input after the prior chunk finished it making it much more efficient and faster than the prior version. Here's a visual taken from the PyTorch repository:

	202		From here all that's left is to actually perform the distributed inference!
	203
	204		<Tip warning={true}>
	205		When passing in inputs, while using `kwargs` are supported currently those are even more experimental, so it's highly recommended to just simply pass inptus in as a tuple of arguments.

135	140	On the first GPU, the prompts will be `["a dog", "a cat"]`, and on the second GPU it will be `["a chicken", "a chicken"]`.
136	141	Make sure to drop the final sample, as it will be a duplicate of the previous one.
	142
	143	## A more memory-efficient version (experimental)

17	17
18		Distributed inference is a common use case, especially with natural language processing (NLP) models. Users often want to
19		send a number of different prompts, each to a different GPU, and then get the results back. This also has other cases
20		outside of just NLP, however for this tutorial we will focus on just this idea of each GPU receiving a different prompt,
21		and then returning the results.
	18	Distributed inference can fall into three brackets:
22	19
23		## The Problem
	20	1. Loading an entire model onto each GPU and sending chunks of a batch through each model at a time

	2. Load parts of a model onto each GPU and process a single input at one time
	2. Loading parts of a model onto each GPU and processing a single input at one time

	3. Load parts of a model onto each GPU and use what is called scheduled Pipeline Parallelism to combine the two prior techniques.
	3. Loading parts of a model onto each GPU and using what is called scheduled Pipeline Parallelism to combine the two prior techniques.

	21		2. Load parts of a model onto each GPU and process a single input at one time
	22		3. Load parts of a model onto each GPU and use what is called scheduled Pipeline Parallelism to combine the two prior techniques.
	23
	24		We're going to go through the first and the last, showcasing how to do each as they are more realistic scenarios.

	24		We're going to go through the first and the last, showcasing how to do each as they are more realistic scenarios.
	25
	26
	27		## Sending chunks of inputs automatically to each loaded model

	## Sending chunks of inputs automatically to each loaded model
	## Sending chunks of a batch automatically to each loaded model

	26
	27		## Sending chunks of inputs automatically to each loaded model
	28
	29		This is the most memory-intensive solution, as it requires each GPU keeps a full copy of the model in memory at a given time.

	148
	149		![PiPPy example](https://camo.githubusercontent.com/681d7f415d6142face9dd1b837bdb2e340e5e01a58c3a4b119dea6c0d99e2ce0/68747470733a2f2f692e696d6775722e636f6d2f657955633934372e706e67)
	150
	151		To use this with Accelerate, we have created a [model zoo](https://github.com/muellerzr/pippy-device-map-playground/) showcasing a number of different models and situations to do so. In this tutorial we'll take GPT2 however across two gpus.

	To use this with Accelerate, we have created a [model zoo](https://github.com/muellerzr/pippy-device-map-playground/) showcasing a number of different models and situations to do so. In this tutorial we'll take GPT2 however across two gpus.
	To illustrate how you can use this with Accelerate, we have created a [model zoo example](https://github.com/muellerzr/pippy-device-map-playground/) showcasing a number of different models and situations. In this tutorial, we'll show this method for GPT2 across two GPUs.

	150
	151		To use this with Accelerate, we have created a [model zoo](https://github.com/muellerzr/pippy-device-map-playground/) showcasing a number of different models and situations to do so. In this tutorial we'll take GPT2 however across two gpus.
	152
	153		Before anything, please make sure you have the latest pippy installed by performing:

accelerate torch-native pipeline parallelism for big models #2345 Merged

torch-native pipeline parallelism for big models #2345

Example use:

Speed up:

Bert

GPT2

T5

accelerate
torch-native pipeline parallelism for big models
#2345

Merged

	Before anything, please make sure you have the latest pippy installed by performing:
	Before you proceed, please make sure you have the latest pippy installed by running the following:

	156		pip install torchpippy
	157		```
	158
	159		We require at least version 0.2.0, please perform `pip show torchpippy` to check this!

	We require at least version 0.2.0, please perform `pip show torchpippy` to check this!
	We require at least version 0.2.0. To confirm that you have the correct version, run `pip show torchpippy`.

	158
	159		We require at least version 0.2.0, please perform `pip show torchpippy` to check this!
	160
	161		First we need to create the model on the CPU:

	First we need to create the model on the CPU:
	Start by creating the model on the CPU:

	168		model.eval()
	169		```
	170
	171		Next we need to create some example inputs to use. These help PiPPy trace the model.

	Next we need to create some example inputs to use. These help PiPPy trace the model.
	Next you'll need to create some example inputs to use. These help PiPPy trace the model.

	172
	173		<Tip warning={true}>
	174		However you make this example will determine the relative batch size that will be used/passed
	175		through the model at a given time, so make sure to remember them!

	185		requires_grad=False,
	186		)
	187		```
	188		Next we need to actually perform the tracing and get the model ready. To do so you simply use the [`inference.prepare_pippy`] function and it will fully wrap the model for pipeline parallism automatically:

	195
	196		<Tip>
	197		There are a variety of parameters you can pass through to `prepare_pippy`:
	198		* `split_points` will let you determine where to split the model at. By default we use wherever `device_map="auto" declares