C4-mC4 pre processing #9

sbmaruf wants to merge 42 commits into bigscience-workshop:main from sbmaruf:c4-mc4-pre_processing
sbmaruf
sbmaruf caching c4-mc4
06a51b1e
sbmaruf calc lang sampling ratio
5023ee4b
sbmaruf sbmaruf changed the title C4 mc4 pre processing C4-mC4 pre processing 4 years ago
sbmaruf cleaning
fa2874fd
sbmaruf typo in help
ed565fcc
sbmaruf
sbmaruf
sbmaruf improve logging
baf745b9
sbmaruf export hfdataset to jsonl data
66c32a96
sbmaruf preprocess data
fc4ad61a
stas00
sbmaruf
stas00
huu4ontocord
sbmaruf Merge branch 'bigscience-workshop:main' into c4-mc4-pre_processing
be23a5d6
sbmaruf moving files
2d8a5bda
sbmaruf run scripts for pre-processing
025c8e7c
sbmaruf cache hf-dataset
df754517
sbmaruf data sampling prob calc
2c7b8ca6
sbmaruf extract data from hf dataset
34e0873a
sbmaruf extract data from allenai git lfs
3ad74e39
sbmaruf data process
c293580a
sbmaruf fix, helper scripts
8f0d4163
sbmaruf cleaning, moving scripts
47244d0d
sbmaruf
stas00
huu4ontocord
sbmaruf sync
a5fae351
sbmaruf update README
a626af85
sbmaruf add iterator prob
f1ed1a8f
sbmaruf add comment
2adad9af
sbmaruf add comment
4316f4b9
sbmaruf script update
67938960
sbmaruf update readme
5b220e54
sbmaruf update readme
fc1b4d03
sbmaruf sample iterator selection probability output
89c7dc00
sbmaruf print alpha pyfiglet
5815e5f7
sbmaruf
sbmaruf sbmaruf requested a review from ibeltagy ibeltagy 4 years ago
sbmaruf sbmaruf requested a review from stas00 stas00 4 years ago
sbmaruf typos
f48f5b26
sbmaruf output of iterator selection probs
bf03ae65
sbmaruf per shard prob added
6e3d285f
sbmaruf sbmaruf requested a review from yongzx yongzx 4 years ago
sbmaruf typo
5f8d7d63
stas00
stas00 requested changes on 2021-08-10
sbmaruf remove small log
9b37746f
sbmaruf merge
3d2f7364
sbmaruf update caching description.
828bdcb8
sbmaruf consistently naming of data size.
8383432b
sbmaruf typo
9a588e8b
stas00
stas00 commented on 2021-08-11
stas00
stas00 commented on 2021-08-11
sbmaruf Data size formatting: KB, MB, GB, TB
1d05287f
sbmaruf Update Readme
ffb3924e
yongzx
sbmaruf
thomasw21
thomasw21 commented on 2021-08-11
yongzx
yongzx commented on 2021-08-12
yongzx
yongzx commented on 2021-08-12
sbmaruf update comment and run script
b676d123
sbmaruf typo
5c146748
sbmaruf add small doc-string for each of the modules.
197c83a0
sbmaruf cleaning, recovering.
04b47ed0

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone