来源:未知
点击: 次
初学者可直接参考:
Submitting Applications
参考:
<参考:两者区别:
若需要RDD嵌套RDD,或要使用的算法只有python自己的sklearn里有,可以考虑对样本分组做分布式的(但模型训练是单机的,所以这种方法的前提是:分完组的数据量在单机训练占用的内存不多)
Say you find yourself in the peculiar situation where you need to train a whole bunch ofscikit-learnmodels over different groups from a large amount of data. And say you want to leverage Spark to distribute the process to do it all in a scalable fashion