Python : 긴 누적 합계를 개선하십시오.

많은 실험 데이터를 처리하는 프로그램이 있습니다. 샘플이었다있는 노드의 클러스터의 이름 - 샘플 Python : 긴 누적 합계를 개선하십시오.

클러스터의 시간 -

time_point : 데이터는 다음과 같은 속성을 가진 클래스의 인스턴스 인 객체의 목록으로 저장됩니다 촬영
노드 - 샘플
qty1 찍은되는 노드의 이름 = 제 양
qty2의 샘플 값 = 제 양

의 샘플 값

데이터 집합에서 샘플을 전체적으로 한 번, 노드의 각 클러스터마다 한 번, 각 노드마다 한 번씩 세 가지 방식으로 그룹화 한 일부 값을 파생시킬 필요가 있습니다. 내가 도출 할 필요가있는 값은 qty1과 qty2의 누적 합계 (시간 정렬 된)에 달려 있습니다. qty1과 qty2의 누적 합계의 요소 별 합계의 최대 값, 최대 값이 발생한 시점 및 그 시점에서의 qty1 및 qty2의 값.

나는 다음과 같은 해결책을했다 :

dataset.sort(key=operator.attrgetter('time_point')) 

# For the whole set 
sys_qty1 = 0 
sys_qty2 = 0 
sys_combo = 0 
sys_max = 0 

# For the cluster grouping 
cluster_qty1 = defaultdict(int) 
cluster_qty2 = defaultdict(int) 
cluster_combo = defaultdict(int) 
cluster_max = defaultdict(int) 
cluster_peak = defaultdict(int) 

# For the node grouping 
node_qty1 = defaultdict(int) 
node_qty2 = defaultdict(int) 
node_combo = defaultdict(int) 
node_max = defaultdict(int) 
node_peak = defaultdict(int) 

for t in dataset: 
    # For the whole system ###################################################### 
    sys_qty1 += t.qty1 
    sys_qty2 += t.qty2 
    sys_combo = sys_qty1 + sys_qty2 
    if sys_combo > sys_max: 
    sys_max = sys_combo 
    # The Peak class is to record the time point and the cumulative quantities 
    system_peak = Peak(time_point=t.time_point, 
         qty1=sys_qty1, 
         qty2=sys_qty2) 
    # For the cluster grouping ################################################## 
    cluster_qty1[t.cluster] += t.qty1 
    cluster_qty2[t.cluster] += t.qty2 
    cluster_combo[t.cluster] = cluster_qty1[t.cluster] + cluster_qty2[t.cluster] 
    if cluster_combo[t.cluster] > cluster_max[t.cluster]: 
    cluster_max[t.cluster] = cluster_combo[t.cluster] 
    cluster_peak[t.cluster] = Peak(time_point=t.time_point, 
            qty1=cluster_qty1[t.cluster], 
            qty2=cluster_qty2[t.cluster]) 
    # For the node grouping ##################################################### 
    node_qty1[t.node] += t.qty1 
    node_qty2[t.node] += t.qty2 
    node_combo[t.node] = node_qty1[t.node] + node_qty2[t.node] 
    if node_combo[t.node] > node_max[t.node]: 
    node_max[t.node] = node_combo[t.node] 
    node_peak[t.node] = Peak(time_point=t.time_point, 
          qty1=node_qty1[t.node], 
          qty2=node_qty2[t.node])

이 올바른 출력을 생성하지만, 더 읽기 쉽고/파이썬 할 수 있을지 궁금하네요, 및/또는 속도/확장 성.

위의 내용은 (큰) 데이터 세트를 한 번만 반복한다는 점에서 매력적이지만, 본질적으로 같은 알고리즘의 세 복사본을 복사/붙여 넣었습니다.

위의 복사/붙여 넣기 문제를 방지하기 위해,이 또한 시도 : 들어

def find_peaks(level, dataset): 

    def grouping(object, attr_name): 
    if attr_name == 'system': 
     return attr_name 
    else: 
     return object.__dict__[attrname] 

    cuml_qty1 = defaultdict(int) 
    cuml_qty2 = defaultdict(int) 
    cuml_combo = defaultdict(int) 
    level_max = defaultdict(int) 
    level_peak = defaultdict(int) 

    for t in dataset: 
    cuml_qty1[grouping(t, level)] += t.qty1 
    cuml_qty2[grouping(t, level)] += t.qty2 
    cuml_combo[grouping(t, level)] = (cuml_qty1[grouping(t, level)] + 
             cuml_qty2[grouping(t, level)]) 
    if cuml_combo[grouping(t, level)] > level_max[grouping(t, level)]: 
     level_max[grouping(t, level)] = cuml_combo[grouping(t, level)] 
     level_peak[grouping(t, level)] = Peak(time_point=t.time_point, 
              qty1=node_qty1[grouping(t, level)], 
              qty2=node_qty2[grouping(t, level)]) 
    return level_peak 

system_peak = find_peaks('system', dataset) 
cluster_peak = find_peaks('cluster', dataset) 
node_peak = find_peaks('node', dataset)

제 (비 분류) 꽤되는, 시스템 레벨의 계산, 나는이 함께했다 :

dataset.sort(key=operator.attrgetter('time_point')) 

def cuml_sum(seq): 
    rseq = [] 
    t = 0 
    for i in seq: 
    t += i 
    rseq.append(t) 
    return rseq 

time_get = operator.attrgetter('time_point') 
q1_get = operator.attrgetter('qty1') 
q2_get = operator.attrgetter('qty2') 

timeline = [time_get(t) for t in dataset] 
cuml_qty1 = cuml_sum([q1_get(t) for t in dataset]) 
cuml_qty2 = cuml_sum([q2_get(t) for t in dataset]) 
cuml_combo = [q1 + q2 for q1, q2 in zip(cuml_qty1, cuml_qty2)] 

combo_max = max(cuml_combo) 
time_max = timeline.index(combo_max) 
q1_at_max = cuml_qty1.index(time_max) 
q2_at_max = cuml_qty2.index(time_max)

그러나, 지능형리스트 및 우편()이 버전의 멋진 사용에도 불구하고 세 번 만 시스템 수준의 계산을위한 데이터 세트를 통해 루프, 그리고 내가 할 수있는 좋은 방법을 생각할 수 없다 천천히 뭔가를하지 않고 클러스터 수준 및 노드 수준 calaculations :

timeline = defaultdict(int) 
cuml_qty1 = defaultdict(int) 
#...etc. 

for c in cluster_list: 
    timeline[c] = [time_get(t) for t in dataset if t.cluster == c] 
    cuml_qty1[c] = [q1_get(t) for t in dataset if t.cluster == c] 
    #...etc.

여기에 스택 오버플로를 수행하는 사람 중에 개선을위한 제안이 있습니까? 위의 첫 번째 스 니펫은 초기 데이터 세트 (백만 레코드 순서)에서 잘 실행되지만 이후 데이터 세트는 더 많은 레코드 및 클러스터/노드를 가지므로 확장 성이 중요합니다.

이것은 처음으로 파이썬을 사용하기에 적합한 언어를 사용하고 있는지 확인하고자합니다 (이것은 매우 복잡한 SQL 쿼리 집합을 대체하고 Python 버전의 이전 버전은 그것은 본질적으로 매우 비효율적 인 직선적 인 변환). 나는 보통 프로그래밍을 많이하지 않으므로 초등학생을 그리워 할 수 있습니다.

감사합니다.

출처

2010-05-30 bbayles

먼저 모든 노드 계산을 수행 한 다음 노드 결과를 사용하여 클러스터 결과를 계산 한 다음 클러스터 결과를 사용하여 시스템 전체 결과를 계산할 수 있습니다. 이것은 최소한 당신이 현재하고있는 반복 (동일한 추가)의 일부를 줄입니다. – unutbu

제안 해 주셔서 감사합니다. 그러나 클러스터 피크는 개별 노드의 피크와 다를 수 있습니다. 예를 들어, 모든 클러스터가 한꺼번에 중간 값을 초과하여 클러스터의 엄청난 피크를 발생시킬 수는 있지만 개별 노드에 대한 엄청난 피크는 아닙니다. – bbayles

이것은 약간의 객체 지향을 적용하는 고전적인 기회처럼 보입니다. 파생 된 데이터를 클래스로 만들고 누적 합계 계산을 해당 클래스에서 작동하는 것으로 추상화하는 것이 좋습니다.같은

뭔가 :

class DerivedData(object): 
    def __init__(self): 
     self.qty1 = 0.0 
     self.qty2 = 0.0 
     self.combo = 0.0 
     self.max = 0.0 
     self.peak = Peak(time_point=0.0, qty1=0.0, qty2=0.0) 

    def accumulate(self, data): 
     self.qty1 += data.qty1 
     self.qty2 += data.qty2 
     self.combo = self.qty1 + self.qty2 
     if self.combo > self.max: 
      self.max = self.combo 
      self.peak = Peak(time_point=data.time_point, 
          qty1=self.qty1, 
          qty2=self.qty2) 

sys = DerivedData() 
clusters = defaultdict(DerivedData) 
nodes = defaultdict(DerivedData) 

dataset.sort(key=operator.attrgetter('time_point')) 

for t in dataset: 
    sys.accumulate(t) 
    clusters[t.cluster].accumulate(t) 
    nodes[t.node].accumulate(t)

이 솔루션은 피크를 찾는 로직을 추상화하지만 여전히 한 번만 데이터 세트를 통해 이동합니다.

출처

2010-05-30 03:31:05

피터, 많은 감사합니다. 이것은 확실히 더 좋게 보인다. 나는 그것을 밖으로 시도하고 어떻게되는지보십시오. 모든 시간과 데이터 값이 정수로 보장된다는 사실을 언급해야합니다. 사실 각 레벨의 각 수량의 합은 0입니다. – bbayles

Python : 긴 누적 합계를 개선하십시오.

답변

관련 문제