caching c4-mc4
06a51b1e
calc lang sampling ratio
5023ee4b
sbmaruf
changed the title C4 mc4 pre processing C4-mC4 pre processing 4 years ago
cleaning
fa2874fd
typo in help
ed565fcc
improve logging
baf745b9
export hfdataset to jsonl data
66c32a96
preprocess data
fc4ad61a
Merge branch 'bigscience-workshop:main' into c4-mc4-pre_processing
be23a5d6
moving files
2d8a5bda
run scripts for pre-processing
025c8e7c
cache hf-dataset
df754517
data sampling prob calc
2c7b8ca6
extract data from hf dataset
34e0873a
extract data from allenai git lfs
3ad74e39
data process
c293580a
fix, helper scripts
8f0d4163
cleaning, moving scripts
47244d0d
sync
a5fae351
update README
a626af85
add iterator prob
f1ed1a8f
add comment
2adad9af
add comment
4316f4b9
script update
67938960
update readme
5b220e54
update readme
fc1b4d03
sample iterator selection probability output
89c7dc00
print alpha pyfiglet
5815e5f7
typos
f48f5b26
output of iterator selection probs
bf03ae65
per shard prob added
6e3d285f
typo
5f8d7d63
stas00
requested changes
on 2021-08-10
remove small log
9b37746f
merge
3d2f7364
update caching description.
828bdcb8
consistently naming of data size.
8383432b
typo
9a588e8b
stas00
commented
on 2021-08-11
stas00
commented
on 2021-08-11
Data size formatting: KB, MB, GB, TB
1d05287f
Update Readme
ffb3924e
yongzx
commented
on 2021-08-12
yongzx
commented
on 2021-08-12
update comment and run script
b676d123
typo
5c146748
add small doc-string for each of the modules.
197c83a0
cleaning, recovering.
04b47ed0
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub