背景

这个任务是使用pthread实现多线程画一个分形图形Mandelbrot(如下图),它是一个有名的复数集(the Mandelbrot set)的可视化,其结果是一个熟悉而美丽的分形。图中每个像素对应复平面的一个值,每个像素的亮度与确定该值是否包含在 Mandelbrot 集中的计算代价成正比。

任务

单线程跑一次,统计时间,然后多线程跑多次,使用平均时间,可以比较单线程和多线程提高的速度。

pthread入门

pthread_create

声明:

1
2
3
4
int pthread_create(pthread_t *thread, 
const pthread_attr_t *restrict_attr,
void* (*start_rtn)(void*)),
void *restrict arg);

参数说明:

  • 第一个参数pthread_t为指向线程标识符的指针,可以使一个整数或者结构体,不想用int可以使用自带类worker
  • 第二个参数pthread_attr_t*用来设置线程属性,上面也可以用NULL,表示使用默认的属性。
  • 第三个参数是线程运行的起始(启动)函数。
  • 最后一个参数是线程启动函数的参数,NULL表示无参数。pthread_create

下面是一段测试代码,大概就是在主线程新建5个线程,每个线程告诉一下自己的线程号(注意,不是线程标记,而是我们自己传的参数,给他一个号码)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#include <bits/stdc++.h>
#include <pthread.h>
#include <iostream>
using namespace std;

void* pthread_start(void *args)
{
printf("This is thread %d\n",*(static_cast<int*>(args)));
return NULL;
}


int main()
{
pthread_t workers[5];
int thread_ID[5] = {1,2,3,4,5};

for(int i = 0; i < 5; i++)
{
pthread_create(&workers[i],NULL,pthread_start,&thread_ID[i]);
}

printf("This is thread 0. And program finished!\n");

}

结果如下图


事实上只要主线程执行完了,就会直接退出并不会等待其他创建的线程,所以我们看到,有的结果即使5个进程还没执行完,程序依然退出了(但是,由于一个线程执行时间太短,只有一个print语句,所以主线程结束的同时,也打印出来了,为了让结果更明显,我们修改pthread_start函数,让它先等待一段时间再输出)

1
2
3
4
5
6
void* pthread_start(void *args)
{
for(int i = 0; i < 1000000;i++);
printf("This is thread %d\n",*(static_cast<int*>(args)));
return NULL;
}

这样的话,结果如下图:

可以看到 ,5个子线程没有一个可以执行完

另外,需要注意的是,整个程序一共有6个线程,而不是5个,因为main函数里面本身就是一个线程

pthread_join

为了解决上面的问题,可以使用pthread_join,函数,这个函数,会让主线程阻塞直到其他线程(有被pthread_join执行过的线程)执行完再退出

函数原型

1
pthread_join(pthread_t thread, void **retval);

参数:

  • pthread_t, 线程标识
  • **retval 用户自定义的指针,存储线程返回值,返回0,标识成功,否则为错误号。

我们把在主函数中增加pthread_join

1
2
3
4
5
6
7
8
9
10
11
12
13
int main()
{
pthread_t workers[5];
int thread_ID[10] = {1,2,3,4,5,6,7,8,9,10};
for(int i = 0; i < 5; i++)
{
pthread_create(&workers[i],NULL,pthread_start,&thread_ID[i]);
}
for(int i = 0; i < 5;i++)
pthread_join(workers[i],NULL);
printf("This is thread 0. And program finished!\n");
return 0;
}

执行结果如下图:

可以看到,每次都是子线程执行完了,主线程才继续执行,


假如我们稍稍修改,只对线程1-3执行pthread_join

1
2
for(int i = 0; i < 3;i++)
pthread_join(workers[i],NULL);

结果如下图,我们可以看到,线程1~3 必定都执行完,线程4、5就看造化了。

画出 mandelbrot图形

单线程画法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
//
// MandelbrotSerial --
//
// Compute an image visualizing the mandelbrot set. The resulting
// array contains the number of iterations required before the complex
// number corresponding to a pixel could be rejected from the set.
//
// * x0, y0, x1, y1 describe the complex coordinates mapping
// into the image viewport.
// * width, height describe the size of the output image
// * startRow, totalRows describe how much of the image to compute
void mandelbrotSerial(
float x0, float y0, float x1, float y1,
int width, int height,
int startRow, int totalRows,
int maxIterations,
int output[])
{
float dx = (x1 - x0) / width;
float dy = (y1 - y0) / height;

int endRow = startRow + totalRows;

for (int j = startRow; j < endRow; j++) {
for (int i = 0; i < width; ++i) {
float x = x0 + i * dx;
float y = y0 + j * dy;

int index = (j * width + i);
output[index] = mandel(x, y, maxIterations);
}
}
}

这里调用的mandel函数,是计算mandelbrot的算法,我们并不关心,下面直接给出

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
static inline int mandel(float c_re, float c_im, int count)
{
float z_re = c_re, z_im = c_im;
int i;
for (i = 0; i < count; ++i) {

if (z_re * z_re + z_im * z_im > 4.f)
break;

float new_re = z_re*z_re - z_im*z_im;
float new_im = 2.f * z_re * z_im;
z_re = c_re + new_re;
z_im = c_im + new_im;
}

return i;
}

多线程画

每个线程需要的参数,用一个结构体实现

1
2
3
4
5
6
7
8
9
10
typedef struct {
float x0, x1;
float y0, y1;
unsigned int width;
unsigned int height;
int maxIterations;
int* output;
int threadId;
int numThreads;
} WorkerArgs;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
void mandelbrotThread(
int numThreads,
float x0, float y0, float x1, float y1,
int width, int height,
int maxIterations, int output[])
{
const static int MAX_THREADS = 32;

if (numThreads > MAX_THREADS)
{
fprintf(stderr, "Error: Max allowed threads is %d\n", MAX_THREADS);
exit(1);
}

pthread_t workers[MAX_THREADS];
WorkerArgs args[MAX_THREADS];

//给每一个子线程传参,
for (int i=0; i<numThreads; i++) {
// TODO: Set thread arguments here.
args[i].numThreads = numThreads;
args[i].x0 = x0;
args[i].x1 = x1;
args[i].y0 = y0;
args[i].y1 = y1;
args[i].width = width;
args[i].height = height;
args[i].maxIterations = maxIterations;
args[i].threadId = i;
args[i].output = output;
}

// Fire up the worker threads. Note that numThreads-1 pthreads
// are created and the main app thread is used as a worker as
// well.
//创建n-1个线程完成剩下的工作(因为主线程完成1/n)
for (int i=1; i<numThreads; i++){
pthread_create(&workers[i], NULL, workerThreadStart, &args[i]);
//printf("pthread_create!%d/%d\n",i,numThreads);
}
//不能浪费主线程的算力,主线程算一部分
workerThreadStart(&args[0]);
//printf("pthread_start[%d]\n",i);

//等待所有子线程完成,因为并没有子线程worker[0]
// wait for worker threads to complete
for (int i=1; i<numThreads; i++)
pthread_join(workers[i], NULL);

}
workerThreadStart的实现
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
void* workerThreadStart(void* threadArgs) {

WorkerArgs* args = static_cast<WorkerArgs*>(threadArgs);
//这里采用的分配方法是,每个线程画 总行数/线程数部分,也就是水平划分,当然也可以把图像垂直划分
int startRow = args->height / args->numThreads*args->threadId;
int totalRows = args->height / args->numThreads;
int endRow = startRow + totalRows;

// TODO: Implement worker thread here.
float dx = (args->x1 - args->x0) / args->width;
float dy = (args->y1 - args->y0) / args->height;


// printf("starting![%d]\n",args->threadId);

for (int j = startRow; j < endRow; j++) {
for (int i = 0; i < args->width; ++i) {
float x = args->x0 + i * dx;
float y = args->y0 + j * dy;

int index = (j * args->width + i);
// printf("drawing!-%d\n",args->threadId);
args->output[index] = mandel(x, y, args->maxIterations);
}
};

printf("Hello world from thread %d\n", args->threadId);

// return NULL;
}
一种浪费算力的写法,我们稍稍修改mandelbrotThread,让主线程空等,子线程完成所有工作

做法:

注释掉pthrad_start(&args[0]),把create_pthreadpthread_join加上worker[0]且其参数为args[0]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
void mandelbrotThread(
int numThreads,
float x0, float y0, float x1, float y1,
int width, int height,
int maxIterations, int output[])
{
const static int MAX_THREADS = 32;

if (numThreads > MAX_THREADS)
{
fprintf(stderr, "Error: Max allowed threads is %d\n", MAX_THREADS);
exit(1);
}

pthread_t workers[MAX_THREADS];
WorkerArgs args[MAX_THREADS];

for (int i=0; i<numThreads; i++) {
// TODO: Set thread arguments here.
args[i].numThreads = numThreads;
args[i].x0 = x0;
args[i].x1 = x1;
args[i].y0 = y0;
args[i].y1 = y1;
args[i].width = width;
args[i].height = height;
args[i].maxIterations = maxIterations;
args[i].threadId = i;
args[i].output = output;
}

// Fire up the worker threads. Note that numThreads-1 pthreads
// are created and the main app thread is used as a worker as
// well.
//创建n-1个线程完成剩下的工作(因为主线程完成1/n)
for (int i=0; i<numThreads; i++){
pthread_create(&workers[i], NULL, workerThreadStart, &args[i]);
//printf("pthread_create!%d/%d\n",i,numThreads);
}
//不能浪费主线程的算力,主线程算一部分
//workerThreadStart(&args[0]);
//printf("pthread_start[%d]\n",i);

//等待所有子线程完成,因为并没有子线程worker[0]
// wait for worker threads to complete
for (int i=0; i<numThreads; i++)
pthread_join(workers[i], NULL);
}
检查结果是否一致
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
bool verifyResult (int *gold, int *result, int width, int height) {

int i, j;

for (i = 0; i < height; i++) {
for (j = 0; j < width; j++) {
if (gold[i * width + j] != result[i * width + j]) {
printf ("Mismatch : [%d][%d], Expected : %d, Actual : %d\n",
i, j, gold[i * width + j], result[i * width + j]);
return 0;
}
}
}

return 1;
}

最终结果

可以看到,2、4 、8线程,分别是单线程的1.94 、2.15 、2.5倍。

所有代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <algorithm>
#include <getopt.h>
#include <pthread.h>
#include <math.h>
#include <iostream>

#ifndef _SYRAH_CYCLE_TIMER_H_
#define _SYRAH_CYCLE_TIMER_H_

#if defined(__APPLE__)
#if defined(__x86_64__)
#include <sys/sysctl.h>
#else
#include <mach/mach.h>
#include <mach/mach_time.h>
#endif // __x86_64__ or not

#include <stdio.h> // fprintf
#include <stdlib.h> // exit

#elif _WIN32
# include <windows.h>
# include <time.h>
#else
# include <stdio.h>
# include <stdlib.h>
# include <string.h>
# include <sys/time.h>
#endif




// This uses the cycle counter of the processor. Different
// processors in the system will have different values for this. If
// you process moves across processors, then the delta time you
// measure will likely be incorrect. This is mostly for fine
// grained measurements where the process is likely to be on the
// same processor. For more global things you should use the
// Time interface.

// Also note that if you processors' speeds change (i.e. processors
// scaling) or if you are in a heterogenous environment, you will
// likely get spurious results.
class CycleTimer {
public:
typedef unsigned long long SysClock;

//////////
// Return the current CPU time, in terms of clock ticks.
// Time zero is at some arbitrary point in the past.
static SysClock currentTicks() {
#if defined(__APPLE__) && !defined(__x86_64__)
return mach_absolute_time();
#elif defined(_WIN32)
LARGE_INTEGER qwTime;
QueryPerformanceCounter(&qwTime);
return qwTime.QuadPart;
#elif defined(__x86_64__)
unsigned int a, d;
asm volatile("rdtsc" : "=a" (a), "=d" (d));
return static_cast<unsigned long long>(a) |
(static_cast<unsigned long long>(d) << 32);
#elif defined(__ARM_NEON__) && 0 // mrc requires superuser.
unsigned int val;
asm volatile("mrc p15, 0, %0, c9, c13, 0" : "=r"(val));
return val;
#else
timespec spec;
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &spec);
return CycleTimer::SysClock(static_cast<float>(spec.tv_sec) * 1e9 + static_cast<float>(spec.tv_nsec));
#endif
}

//////////
// Return the current CPU time, in terms of seconds.
// This is slower than currentTicks(). Time zero is at
// some arbitrary point in the past.
static double currentSeconds() {
return currentTicks() * secondsPerTick();
}

//////////
// Return the conversion from seconds to ticks.
static double ticksPerSecond() {
return 1.0/secondsPerTick();
}

static const char* tickUnits() {
#if defined(__APPLE__) && !defined(__x86_64__)
return "ns";
#elif defined(__WIN32__) || defined(__x86_64__)
return "cycles";
#else
return "ns"; // clock_gettime
#endif
}

//////////
// Return the conversion from ticks to seconds.
static double secondsPerTick() {
static bool initialized = false;
static double secondsPerTick_val;
if (initialized) return secondsPerTick_val;
#if defined(__APPLE__)
#ifdef __x86_64__
int args[] = {CTL_HW, HW_CPU_FREQ};
unsigned int Hz;
size_t len = sizeof(Hz);
if (sysctl(args, 2, &Hz, &len, NULL, 0) != 0) {
fprintf(stderr, "Failed to initialize secondsPerTick_val!\n");
exit(-1);
}
secondsPerTick_val = 1.0 / (double) Hz;
#else
mach_timebase_info_data_t time_info;
mach_timebase_info(&time_info);

// Scales to nanoseconds without 1e-9f
secondsPerTick_val = (1e-9*static_cast<double>(time_info.numer))/
static_cast<double>(time_info.denom);
#endif // x86_64 or not
#elif defined(_WIN32)
LARGE_INTEGER qwTicksPerSec;
QueryPerformanceFrequency(&qwTicksPerSec);
secondsPerTick_val = 1.0/static_cast<double>(qwTicksPerSec.QuadPart);
#else
FILE *fp = fopen("/proc/cpuinfo","r");
char input[1024];
if (!fp) {
fprintf(stderr, "CycleTimer::resetScale failed: couldn't find /proc/cpuinfo.");
exit(-1);
}
// In case we don't find it, e.g. on the N900
secondsPerTick_val = 1e-9;
while (!feof(fp) && fgets(input, 1024, fp)) {
// NOTE(boulos): Because reading cpuinfo depends on dynamic
// frequency scaling it's better to read the @ sign first
float GHz, MHz;
if (strstr(input, "model name")) {
char* at_sign = strstr(input, "@");
if (at_sign) {
char* after_at = at_sign + 1;
char* GHz_str = strstr(after_at, "GHz");
char* MHz_str = strstr(after_at, "MHz");
if (GHz_str) {
*GHz_str = '\0';
if (1 == sscanf(after_at, "%f", &GHz)) {
//printf("GHz = %f\n", GHz);
secondsPerTick_val = 1e-9f / GHz;
break;
}
} else if (MHz_str) {
*MHz_str = '\0';
if (1 == sscanf(after_at, "%f", &MHz)) {
//printf("MHz = %f\n", MHz);
secondsPerTick_val = 1e-6f / GHz;
break;
}
}
}
} else if (1 == sscanf(input, "cpu MHz : %f", &MHz)) {
//printf("MHz = %f\n", MHz);
secondsPerTick_val = 1e-6f / MHz;
break;
}
}
fclose(fp);
#endif

initialized = true;
return secondsPerTick_val;
}

//////////
// Return the conversion from ticks to milliseconds.
static double msPerTick() {
return secondsPerTick() * 1000.0;
}

private:
CycleTimer();
};

#endif // #ifndef _SYRAH_CYCLE_TIMER_H_

/*

15418 Spring 2012 note: This code was modified from example code
originally provided by Intel. To comply with Intel's open source
licensing agreement, their copyright is retained below.

-----------------------------------------------------------------

Copyright (c) 2010-2011, Intel Corporation
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:

* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.

* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/


static inline int mandel(float c_re, float c_im, int count)
{
float z_re = c_re, z_im = c_im;
int i;
for (i = 0; i < count; ++i) {

if (z_re * z_re + z_im * z_im > 4.f)
break;

float new_re = z_re*z_re - z_im*z_im;
float new_im = 2.f * z_re * z_im;
z_re = c_re + new_re;
z_im = c_im + new_im;
}

return i;
}

//
// MandelbrotSerial --
//
// Compute an image visualizing the mandelbrot set. The resulting
// array contains the number of iterations required before the complex
// number corresponding to a pixel could be rejected from the set.
//
// * x0, y0, x1, y1 describe the complex coordinates mapping
// into the image viewport.
// * width, height describe the size of the output image
// * startRow, totalRows describe how much of the image to compute
void mandelbrotSerial(
float x0, float y0, float x1, float y1,
int width, int height,
int startRow, int totalRows,
int maxIterations,
int output[])
{
float dx = (x1 - x0) / width;
float dy = (y1 - y0) / height;

int endRow = startRow + totalRows;

for (int j = startRow; j < endRow; j++) {
for (int i = 0; i < width; ++i) {
float x = x0 + i * dx;
float y = y0 + j * dy;

int index = (j * width + i);
output[index] = mandel(x, y, maxIterations);
}
}
}

void
writePPMImage(int* data, int width, int height, const char *filename, int maxIterations)
{
FILE *fp = fopen(filename, "wb");

// write ppm header
fprintf(fp, "P6\n");
fprintf(fp, "%d %d\n", width, height);
fprintf(fp, "255\n");

for (int i = 0; i < width*height; ++i) {

// Clamp iteration count for this pixel, then scale the value
// to 0-1 range. Raise resulting value to a power (<1) to
// increase brightness of low iteration count
// pixels. a.k.a. Make things look cooler.

float mapped = pow( std::min(static_cast<float>(maxIterations),
static_cast<float>(data[i])) / 256.f, .5f);

// convert back into 0-255 range, 8-bit channels
unsigned char result = static_cast<unsigned char>(255.f * mapped);
for (int j = 0; j < 3; ++j)
fputc(result, fp);
}
fclose(fp);
printf("Wrote image file %s\n", filename);
}

void
scaleAndShift(float& x0, float& x1, float& y0, float& y1,
float scale,
float shiftX, float shiftY)
{

x0 *= scale;
x1 *= scale;
y0 *= scale;
y1 *= scale;
x0 += shiftX;
x1 += shiftX;
y0 += shiftY;
y1 += shiftY;

}

void usage(const char* progname) {
printf("Usage: %s [options]\n", progname);
printf("Program Options:\n");
printf(" -t --threads <N> Use N threads\n");
printf(" -v --view <INT> Use specified view settings\n");
printf(" -? --help This message\n");
}

bool verifyResult (int *gold, int *result, int width, int height) {

int i, j;

for (i = 0; i < height; i++) {
for (j = 0; j < width; j++) {
if (gold[i * width + j] != result[i * width + j]) {
printf ("Mismatch : [%d][%d], Expected : %d, Actual : %d\n",
i, j, gold[i * width + j], result[i * width + j]);
return 0;
}
}
}

return 1;
}

typedef struct {
float x0, x1;
float y0, y1;
unsigned int width;
unsigned int height;
int maxIterations;
int* output;
int threadId;
int numThreads;
} WorkerArgs;

//
// workerThreadStart --
//
// Thread entrypoint.
void* workerThreadStart(void* threadArgs) {

WorkerArgs* args = static_cast<WorkerArgs*>(threadArgs);

int startRow = args->height / args->numThreads*args->threadId;
int totalRows = args->height / args->numThreads;
int endRow = startRow + totalRows;

// TODO: Implement worker thread here.
float dx = (args->x1 - args->x0) / args->width;
float dy = (args->y1 - args->y0) / args->height;


// printf("starting![%d]\n",args->threadId);

for (int j = startRow; j < endRow; j++) {
for (int i = 0; i < args->width; ++i) {
float x = args->x0 + i * dx;
float y = args->y0 + j * dy;

int index = (j * args->width + i);
// printf("drawing!-%d\n",args->threadId);
args->output[index] = mandel(x, y, args->maxIterations);
}
};

//printf("Hello world from thread %d\n", args->threadId);

// return NULL;gongzuo
}

//
// MandelbrotThread --
//
// Multi-threaded implementation of mandelbrot set image generation.
// Multi-threading performed via pthreads.
void mandelbrotThread(
int numThreads,
float x0, float y0, float x1, float y1,
int width, int height,
int maxIterations, int output[])
{
const static int MAX_THREADS = 32;

if (numThreads > MAX_THREADS)
{
fprintf(stderr, "Error: Max allowed threads is %d\n", MAX_THREADS);
exit(1);
}

pthread_t workers[MAX_THREADS];
WorkerArgs args[MAX_THREADS];

for (int i=0; i<numThreads; i++) {
// TODO: Set thread arguments here.
args[i].numThreads = numThreads;
args[i].x0 = x0;
args[i].x1 = x1;
args[i].y0 = y0;
args[i].y1 = y1;
args[i].width = width;
args[i].height = height;
args[i].maxIterations = maxIterations;
args[i].threadId = i;
args[i].output = output;
}

// Fire up the worker threads. Note that numThreads-1 pthreads
// are created and the main app thread is used as a worker as
// well.
//创建n-1个线程完成剩下的工作(因为主线程完成1/n)
for (int i=1; i<numThreads; i++){
pthread_create(&workers[i], NULL, workerThreadStart, &args[i]);
//printf("pthread_create!%d/%d\n",i,numThreads);
}
//不能浪费主线程的算力,主线程算一部分
workerThreadStart(&args[0]);
//printf("pthread_start[%d]\n",i);

//等待所有子线程完成,因为并没有子线程worker[0]
// wait for worker threads to complete
for (int i=1; i<numThreads; i++)
pthread_join(workers[i], NULL);


}


int main(int argc, char** argv) {

const unsigned int width = 1200;
const unsigned int height = 800;
const int maxIterations = 256;
int numThreads = 2;

float x0 = -2;
float x1 = 1;
float y0 = -1;
float y1 = 1;

// parse commandline options ////////////////////////////////////////////
int opt;
static struct option long_options[] = {
{"threads", 1, 0, 't'},
{"view", 1, 0, 'v'},
{"help", 0, 0, '?'},
{0 ,0, 0, 0}
};

while ((opt = getopt_long(argc, argv, "t:v:?", long_options, NULL)) != EOF) {

switch (opt) {
case 't':
{
numThreads = atoi(optarg);
break;
}
case 'v':
{
int viewIndex = atoi(optarg);
// change view settings
if (viewIndex == 2) {
float scaleValue = .015f;
float shiftX = -.986f;
float shiftY = .30f;
scaleAndShift(x0, x1, y0, y1, scaleValue, shiftX, shiftY);
} else if (viewIndex > 1) {
fprintf(stderr, "Invalid view index\n");
return 1;
}
break;
}
case '?':
default:
usage(argv[0]);
return 1;
}
}
// end parsing of commandline options


int* output_serial = new int[width*height];
int* output_thread = new int[width*height];

//
// Run the serial implementation. Run the code three times and
// take the minimum to get a good estimate.
//
memset(output_serial, 0, width * height * sizeof(int));
double minSerial = 1e30;
for (int i = 0; i < 5; ++i) {
double startTime = CycleTimer::currentSeconds();
mandelbrotSerial(x0, y0, x1, y1, width, height, 0, height, maxIterations, output_serial);
double endTime = CycleTimer::currentSeconds();
minSerial = std::min(minSerial, endTime - startTime);
}

printf("[mandelbrot serial]:\t\t[%.3f] ms\n", minSerial * 1000);
writePPMImage(output_serial, width, height, "mandelbrot-serial.ppm", maxIterations);

//
// Run the threaded version
//
memset(output_thread, 0, width * height * sizeof(int));
double minThread = 1e30;
for (int i = 0; i < 5; ++i) {
double startTime = CycleTimer::currentSeconds();
mandelbrotThread(numThreads, x0, y0, x1, y1, width, height, maxIterations, output_thread);
double endTime = CycleTimer::currentSeconds();
minThread = std::min(minThread, endTime - startTime);
}

printf("[mandelbrot thread]:\t\t[%.3f] ms\n", minThread * 1000);
writePPMImage(output_thread, width, height, "mandelbrot-thread.ppm", maxIterations);
// std::cout << output_thread << std::endl;
if (! verifyResult (output_serial, output_thread, width, height)) {
printf ("Error : Output from threads does not match serial output\n");

delete[] output_serial;
delete[] output_thread;

return 1;
}

// compute speedup
printf("\t\t\t\t(%.2fx speedup from %d threads)\n", minSerial/minThread, numThreads);

delete[] output_serial;
delete[] output_thread;

return 0;
}