pyspark入门 | spark-submit 提交pyspark任务

来源:未知

点击:

  初学者可直接参考:

  Submitting Applications

  参考:

<参考:

  两者区别:

  若需要RDD嵌套RDD,或要使用的算法只有python自己的sklearn里有,可以考虑对样本分组做分布式的(但模型训练是单机的,所以这种方法的前提是:分完组的数据量在单机训练占用的内存不多)

  Say you find yourself in the peculiar situation where you need to train a whole bunch ofscikit-learnmodels over different groups from a large amount of data. And say you want to leverage Spark to distribute the process to do it all in a scalable fashion