Mahout precomputed 항목 유사성 - 권장 사항이 낮음

Mahout의 사전 컴파일 된 항목 항목 유사성에 성능 문제가 있습니다.Mahout precomputed 항목 유사성 - 권장 사항이 낮음

약 400 만 명의 사용자가 거의 같은 양의 항목을 가지고 있으며 약 100M 개의 사용자 항목 환경 설정이 있습니다. 나는 문서의 TF-IDF 벡터의 코사인 유사성을 기반으로 콘텐츠 기반 추천을하고 싶다.

내가 TF-IDF 벡터를 생성 seq2sparse 사용 : 즉시이를 계산하면 느리므로 다음 는, I는 최고 50 가장 유사한 문서의 페어 유사도를 미리 계산.
는 I 내가이 모든 사전 계산 하둡 사용 상위 50 가장 유사한 문서

생산 부리는 rowSimilarity -i INPUT/matrix -o OUTPUT -r 4587604 --similarityClassname SIMILARITY_COSINE -m 50 -ess 사용 부리는 행렬을

생산 mahout rowId 사용. 4 백만 품목의 경우 출력은 2.5GB에 불과했습니다.

그런 다음 문서의 ID를 디코딩하기 위해 docIndex을 사용하여 축소판에서 생성 한 파일의 내용을 Collection<GenericItemSimilarity.ItemItemSimilarity> corrMatrix = ...에로드했습니다. 그것들은 이미 정수 였지만, rowId는 1부터 시작하여 디코드했습니다. 그래서 그것을 되 찾아야합니다.

는 추천을 위해 나는 다음 코드를 사용합니다

ItemSimilarity similarity = new GenericItemSimilarity(correlationMatrix); 

CandidateItemsStrategy candidateItemsStrategy = new SamplingCandidateItemsStrategy(1, 1, 1, model.getNumUsers(), model.getNumItems()); 
MostSimilarItemsCandidateItemsStrategy mostSimilarItemsCandidateItemsStrategy = new SamplingCandidateItemsStrategy(1, 1, 1, model.getNumUsers(), model.getNumItems()); 

Recommender recommender = new GenericItemBasedRecommender(model, similarity, candidateItemsStrategy, mostSimilarItemsCandidateItemsStrategy);

내가 제한된 데이터 모델 (1.6M 항목)와 함께 노력하고 있어요,하지만 나는 메모리에있는 모든 항목을 항목 페어의 유사성을로드. 필자는 40GB를 사용하여 메인 메모리에 모든 것을로드 할 수 있습니다.

내가 추천 과정에 대한 하나의 사용자

Recommender cachingRecommender = new CachingRecommender(recommender); 
List<RecommendedItem> recommendations = cachingRecommender.recommend(userID, howMany);

경과 시간을 위해 추천하고 싶은

는 554.938583083 초이며, 어떤 권고를 생성하지 않았다 이외. 지금 나는 권고의 수행에 정말로 관심이있다. 내가 CandidateItemsStrategy과 MostSimilarItemsCandidateItemsStrategy의 숫자로 연주했지만 성능이 향상되지 않았습니다.

모든 것을 사전 계산한다는 생각은 권장 프로세스를 가속화한다고 생각하지 않습니까? 누군가 나를 도우 려하고 내가 잘못하고있는 부분과 잘못된 부분을 말해 줄 수 있습니까? 또한 주 메모리에서 병렬 적 유사성을로드하는 것이 기하 급수적으로 폭발하는 이유는 무엇입니까? Collection<GenericItemSimilarity.ItemItemSimilarity> 매트릭스에 40GB의 주 메모리에 2.5GB의 파일이로드되었습니다. 파일이 IntWritable, VectorWritable hashMap 키 - 값을 사용하여 일련 번호가 지정되어 있고 키가 ItemItemSimilarity 행렬의 모든 벡터 값에 대해 반복되어야한다는 것을 알고 있습니다. 그러나 이것은 너무 많지 않습니까?

미리 감사드립니다.

출처

2013-09-03 Dragan Milcevski

미리 계산 된 값에 대해 Collection을 사용하여 권장 계산에 소요되는 시간을 수정했습니다. 외관상으로는 long startTime = System.nanoTime();을 내 코드 상단에 넣었습니다. List<RecommendedItem> recommendations = cachingRecommender.recommend(userID, howMany); 전이 아닙니다. 이것은 데이터 집합과 사전 계산 된 항목 항목 유사성을 주 메모리에로드하는 데 필요한 시간을 계산합니다.

그러나 나는 메모리 소비 뒤에 서있다. 나는 사용자 정의 ItemSimilarity을 사용하고 사전 계산 된 유사성을 HashMap<Long, HashMap<Long, Double>으로로드하면서 그것을 향상시켰다.공간 요구 사항을 줄이기 위해 라이브러리 라이브러리를 사용했습니다.

다음은 세부 코드입니다. 사용자 정의 ItemSimilarity : 내 데이터

함께

public class TextItemSimilarity implements ItemSimilarity{ 

    private TLongObjectHashMap<TLongDoubleHashMap> correlationMatrix; 

    public WikiTextItemSimilarity(TLongObjectHashMap<TLongDoubleHashMap> correlationMatrix){ 
     this.correlationMatrix = correlationMatrix; 
    } 

    @Override 
    public void refresh(Collection<Refreshable> alreadyRefreshed) { 
    } 

    @Override 
    public double itemSimilarity(long itemID1, long itemID2) throws TasteException { 
     TLongDoubleHashMap similarToItemId1 = correlationMatrix.get(itemID1); 
     if(similarToItemId1 != null && !similarToItemId1.isEmpty() && similarToItemId1.contains(itemID2)){ 
      return similarToItemId1.get(itemID2); 
     } 
     return 0; 
    } 
    @Override 
    public double[] itemSimilarities(long itemID1, long[] itemID2s) throws TasteException { 
     double[] result = new double[itemID2s.length]; 
     for (int i = 0; i < itemID2s.length; i++) { 
      result[i] = itemSimilarity(itemID1, itemID2s[i]); 
     } 
     return result; 
    } 
    @Override 
    public long[] allSimilarItemIDs(long itemID) throws TasteException { 
     return correlationMatrix.get(itemID).keys(); 
    } 
}

총 메모리 소비가 Collection<GenericItemSimilarity.ItemItemSimilarity>를 사용하여 설정은 30기가바이트이며, TLongObjectHashMap<TLongDoubleHashMap>를 사용하여 사용자 정의 TextItemSimilarity 때 공간 요구 사항은 17기가바이트입니다. Collection<GenericItemSimilarity.ItemItemSimilarity>을 사용하여 0.05 초, TLongObjectHashMap<TLongDoubleHashMap>을 사용하여 0.07 초입니다. 또한 나는 성능에 큰 역할을 CandidateItemsStrategy 및 MostSimilarItemsCandidateItemsStrategy

일부 공간 사용 야야되는 HashMap을 저장하려면 내가 생각하고 그냥 조금 더 나은 성능을 원한다면, 당신은 Collection<GenericItemSimilarity.ItemItemSimilarity>을 사용할 수를 사용한다는 것을 믿는다.

출처

2013-09-06 13:46:47

Mahout precomputed 항목 유사성 - 권장 사항이 낮음

답변

관련 문제