Couldn't have happened to a nicer guy

☆ Yσɠƚԋσʂ ☆ · 8 months ago

Couldn't have happened to a nicer guy

☆ Yσɠƚԋσʂ ☆ · 8 months ago

What part of OSI are you claiming DeepSeek doesn’t satisfy specifically?

The Octonaut · edit-2 8 months ago

The data part. ie the very first part of the OSI’s definition.

It’s not available from their articles https://arxiv.org/html/2501.12948v1 https://arxiv.org/html/2401.02954v1

Nor on their github https://github.com/deepseek-ai/DeepSeek-LLM

Note that the OSI only ask for transparency of what the dataset was - a name and the fee paid will do - not that full access to it to be free and Free.

It’s worth mentioning too that they’ve used the MIT license for the “code” included with the model (a few YAML files to feed it to software) but they have created their own unrecognised non-free license for the model itself. Why they having this misleading label on their github page would only be speculation.

Without making the dataset available then nobody can accurately recreate, modify or learn from the model they’ve released. This is the only sane definition of open source available for an LLM model since it is not in itself code with a “source”.

☆ Yσɠƚԋσʂ ☆ · 8 months ago

Uh yeah, that’s because people publish data to huggingface. GitHub isn’t made for huge data files in case you weren’t aware. You can scroll down to datasets here https://huggingface.co/deepseek-ai

The Octonaut · 8 months ago

That’s the “prover” dataset, ie the evaluation dataset mentioned in the articles I linked you to. It’s for checking the output, it is not the training output.

It’s also 20mb, which is miniscule not just for a training dataset but even as what you seem to think is a “huge data file” in general.

You really need to stop digging and admit this is one more thing you have surface-level understanding of.

☆ Yσɠƚԋσʂ ☆ · 8 months ago

Do show me a published data set of the kind you’re demanding.

The Octonaut · edit-2 8 months ago

Since you’re definitely asking this in good faith and not just downvoting and making nonsense sealion requests in an attempt to make me shut up, sure! Here’s three.

https://commoncrawl.org/

https://github.com/togethercomputer/RedPajama-Data

https://huggingface.co/datasets/legacy-datasets/wikipedia/tree/main/

Oh, and it’s not me demanding. It’s the OSI defining what an open source AI model is. I’m sure once you’ve asked all your questions you’ll circle back around to whether you disagree with their definition or not.

@HappyTimeHarry@lemm.ee · 8 months ago

Thank you for posting those links, while I’m not sure the person you replied to was asking in good faith, I myself was wanting to see an example after reading the discussion.

Seems like even if it’s not fully open source it’s a step in the right direction in a world where terms like “open” and non profit have been co-opted by corporations to lose their original meaning.

☆ Yσɠƚԋσʂ ☆ · edit-2 8 months ago

So you found a legacy data set that’s been released nearly a year ago as your best example. Thanks for proving my point. And since you obviously know what you’re talking about, do explain to the class what stops people from using these data sets to train a DeepSeek model?

The Octonaut · 8 months ago

The most recent crawl is from December 15th

https://commoncrawl.org/blog/december-2024-crawl-archive-now-available

You don’t know, and can’t know, when DeepSeeker’s dataset is from. Thanks for proving my point.

☆ Yσɠƚԋσʂ ☆ · 8 months ago

What I do know is that you can take DeepSeek model and train it on this open crawl to get a fully open model. I love how you ignored this part in your reply being the clown that you are.