Post
5137
Exciting updates to the Wikipedia Monthly dataset for November! 🚀
・ Fixed a bug to remove infobox leftovers and other wiki markers such as
・ New python package https://pypi.org/project/wikisets: a dataset builder with efficient sampling so you can combine the languages you want seamlessly for any date (ideal for pretraining data but works for any purpose)
・ Moved the pipeline to a large server. Much higher costs but with better reliability and predictability (let me know if you'd like to sponsor this!).
・ Dataset sizes are unfortunately missing for this month due to shenanigans with the migration, but should be back in December's update.
Check out the dataset:
omarkamali/wikipedia-monthly
・ Fixed a bug to remove infobox leftovers and other wiki markers such as
__TOC__・ New python package https://pypi.org/project/wikisets: a dataset builder with efficient sampling so you can combine the languages you want seamlessly for any date (ideal for pretraining data but works for any purpose)
・ Moved the pipeline to a large server. Much higher costs but with better reliability and predictability (let me know if you'd like to sponsor this!).
・ Dataset sizes are unfortunately missing for this month due to shenanigans with the migration, but should be back in December's update.
Check out the dataset:
omarkamali/wikipedia-monthly