/* Class Project for ECE 375. June 10,1993 SEPARATION OF TWO CO-CHANNEL TALKERS USING MAXIMUM LIKELIHOOD PITCH DETECTION AND COMB FILTERING WITH THE SOUNDBLASTER AND THE AMD 29000 X-WING CONSULTING Eugene Kennedy, Scott Langley, and Henry Travis ABSTRACT For our final project in Advanced Microprocessors, we are required to use the AMD 29200 demonstration board. One feature of the AMD 29k is the high processing speed offered by the RISC architecture. A useful feature of the demonstration board is the serial port. The Minimon program provides a user interface that supports PC communication through the serial port. These features are useful for a DSP project that uses data files on a PC. Audio signal processing presents several challenges. We decided to look at the problem of separating multiple signals from a single channel. Specifically, our goal is to be able to separate two voices from a single channel. The necessary algorithms will be implemented in the C programming language and then compiled into 29k machine code. The interface between the 29k and the PC will be done through the use of the Minimon program. Batch files containing a list of sequential commands can be used by Minimon to enter several commands at once. The use of batch files will allow for full automation of the process. THEORY There are two types of speech from the signal analysis point of view: voiced speech and unvoiced speech. Voiced speech consists of more or less constant frequency tones of some duration, made when vowels are spoken. It is produced when periodic pulses of air generated by the vibrating glottis resonate through the vocal tract, at frequencies dependent on the vocal tract shape. About two-thirds of speech is voiced and this type of speech is also what is most important for intelligibility. Unvoiced speech is non-periodic, random-like sounds, caused by air passing through a narrow constriction of the vocal tract as when consonants are spoken. Voiced speech, because of its periodic nature, can be identified and extracted [1]. When two speakers are on a common channel using voiced speech, the signal consists of two different frequency trains, each having a fundamental frequency and associated harmonics whose amplitudes decrease with increased frequency. The goal of our project is to find the fundamental frequencies voiced by each speaker, and selectively filter out the fundamental frequencies and associated harmonics of one of them. The result is voiced speech of only one of the speakers. In separating the voiced speech of two different speakers, care must be taken to track the two different voices over time as they change frequency and undergo intervals of silence. Over an interval of time, it must be determined if each voice is present. If a voice is present, it is classified to belong to the first or second speaker depending on the frequency of its first fundamental and its volume, assuming the frequencies and volume of the two speakers are different. Just as human ears have difficulty separating speech from two speakers at nearly the same pitch and volume, so do computer techniques of speech separation. ALGORITHM DISCUSSION There are three parts to the algorithm: 1.) Determination of the fundamental frequency of a voiced speech signal, 2.) Classification of a signal (voiced/unvoiced, speaker1/speaker2 ), and 3). The routine to filter out a voice. Identification of the fundamental frequency is done by a maximum likelihood pitch estimation algorithm [2]. Using the autocorrelation values of the speech signal, a function is maximized for the value of the pitch period. Classification of the signal uses the most heuristic techniques. In the pitch determination algorithm, the energy of the periodic part of speech is calculated and used to determine if voiced speech from one or two speakers is actually present. If one speaker is always louder than the other, the voiced speech corresponding to a set of harmonics can be assigned to speaker one or speaker two depending on its energy. The fundamental frequencies are tracked over time to provide a basis for classifying the two voiced signals on using their pitch. Filtering out of a certain voiced signal is done with a comb filter [3]. The 'comb' refers to the different harmonics filtered out, appearing like the teeth of a comb when viewed in the frequency domain. It is implemented in the time domain as a finite-impulse-response filter. IMPLEMENTATION Besides the DSP algorithms, this project will require a communications link between the 29200 board and the host PC. This is necessary for two reasons. First, the sound sample is recorded and stored in files on the PC. Second, after the data has been processed by the 29k, it needs to be sent back to the PC to be played. The link is created through the serial port on the demonstration board and the PC. Recording and playing sound files is done with the Soundblaster card and VPlus software. A program called SOX (SOund eXchange - universal sound sample translator) coverts sound data files to different data formats. We convert between the VOC format used by the Soundblaster the RAW format which stores data as signed eight-bit numbers represented by ASCII characters. The data files are uploaded and downloaded from the 29k via the serial port. The entire task can be automated with two batch files. The first batch file is a list of Mondfe commands, which is referred to as a log file. The log file will download the DSP algorithms, download a packet of data, process that packet, upload the processed packet, repeat the last three for each packet, and then quit Mondfe. The second is a DOS batch file that will record the signal, convert the signal to a useful format, breakup the data file into packets, execute the Mondfe program with the -log option, reassemble the data files, and present the user with the option to play back any of the processed signals. Faster data transfer could be achieved using a parallel port, but the parallel port for the 29k would have to be built. Furthermore, reading and writing files along with all the control signals would require far too much work for the amount of time we have to do everything. With the services provided by Minimon, we were able to set up an interface between the PC and 29k with relative ease. The interface is ready and waiting for the completion of the DSP algorithms. REFERENCES [1] J. L. Flanagan, "Voices of Men and Machines," J. Accoust. Soc. Am., vol. 51, pp. 1375-1387, Mar. 1972. [2] J. D. Wise, J. A. Caprio, and T. W. Parks, "Maximum Likelihood Pitch Estimation," IEEE Trans. Accoustics, Speech, and Sig. Proc., vol. ASSP-24, no. 5, pp. 418-423, Oct. 1976. [3] D. O'Shaughnessy, "Enhancing Speech Degraded by Additive Noise or Interfering Speakers," IEEE Communications Magazine, pp. 46-52, Feb. 1989. */ #include #include #include #include /*#include #include */ #define WINDOW_SIZE 1401 #define SHIFT 700 #define BLOCK_FACT 1 #define P_HIGH 700 #define SAMPLING_FREQUENCY 41100 #define MINIMUM_FUNDAMENTAL 60 #define ALPHA 1.0 #define M 3 #define MAX_HARMONIC 10 #define MEMORY 1.0 /*#define POWER_THRESHOLD 30000.0*/ #define GP_THRESHOLD .001 #define DERIV_THRESHOLD .1 /*#define CUTOFF_PERIOD 50*/ int main() { /*struct time first; struct time second;*/ char sound_file[30], out_file1[30],out_file2[30],temp_char,app1[6]="1.raw",app2[6]="2.raw"; char *peak,*sound, *data1, *data2, *input1, *input2, *data_start; int filter_length, i,j, k,limit,L,m,n, N, P, P_LOW=12,P_high,P_max,P_max2; int test_P,index,fund_low1,fund_low2,test_freq,most_likely,out_of_data=0; int on_first_block, temp_int, data_needed,pos_deriv,neg_deriv,harmonic=1; int fund_freq1,fund_freq2,block_size,outside_width, *past_P_max,past_next; int previous_harmonic=1,sixth,fifth,fourth,third,test,*period_shifts; int CUTOFF_PERIOD; long temp_estimator; unsigned long int into=0,outof=0; float fundamental_period, *c; float *auto_corr,sum,temp_float; double *g, g_max, g_max2,estimator,deriv,sum_deriv_before,sum_deriv_after; double gp_threshold, deriv_threshold,POWER_THRESHOLD; double *confidence1, *confidence2, max_confidence, out; FILE *soundfile,*outfile1,*outfile2; outside_width=M*(SAMPLING_FREQUENCY/MINIMUM_FUNDAMENTAL); block_size=(2*outside_width+SHIFT)*BLOCK_FACT; if((data1=(char *)malloc(3*block_size*sizeof(char)))== NULL) { printf("Not enough memory to allocate array\n"); exit(1); } if((data2=(char *)malloc(3*block_size*sizeof(char)))== NULL) { printf("Not enough memory to allocate array\n"); exit(1); } if((auto_corr=(float *)malloc(WINDOW_SIZE*sizeof(float)))== NULL) { printf("Not enough memory to allocate array\n"); exit(1); } if((peak=(char *)malloc(P_HIGH))== NULL) { printf("Not enough memory to allocate array\n"); exit(1); } if((g=(double *)malloc(P_HIGH*sizeof(double)))== NULL) { printf("Not enough memory to allocate array\n"); exit(1); } if((confidence1=(double *)malloc((MAX_HARMONIC+1)*sizeof(double)))== NULL) { printf("Not enough memory to allocate array\n"); exit(1); } if((confidence2=(double *)malloc((MAX_HARMONIC+1)*sizeof(double)))== NULL) { printf("Not enough memory to allocate array\n"); exit(1); } if((c=(float *)malloc((2*M+1)*sizeof(float)))== NULL) { printf("Not enough memory to allocate array\n"); exit(1); } if((past_P_max=(int *)malloc((3)*sizeof(int)))== NULL) { printf("Not enough memory to allocate array\n"); exit(1); } if((period_shifts=(int *)malloc((2*M+1)*sizeof(int)))== NULL) { printf("Not enough memory to allocate array\n"); exit(1); } filter_length=WINDOW_SIZE-1; /*printf("\nEnter GP Threshold: "); scanf("%lf",GP_THRESHOLD); printf("\nEnter DERIV Threshold: "); scanf("%lf",DERIV_THRESHOLD);*/ printf("\nEnter sound data file:"); scanf("%30s",sound_file); printf("\nEnter Cutoff Period: "); scanf("%d",CUTOFF_PERIOD); printf("\nEnter Power Threshold (e.g., 30000.0): "); scanf("%lf",POWER_THRESHOLD); strcpy(out_file1,sound_file); strcat(out_file1,app1); strcpy(out_file2,sound_file); strcat(out_file2,app2); if((soundfile=fopen(sound_file,"rb"))==NULL) { printf("Error opening file: %s\n",sound_file); exit(0); } if((outfile1=fopen(out_file1,"w+t"))==NULL) { printf("Error opening file: %s\n",out_file1); exit(0); } if((outfile2=fopen(out_file2,"w+t"))==NULL) { printf("Error opening file: %s\n",out_file2); exit(0); } temp_float=ALPHA/(2*M+1); /*for(k=M;k<2*M+1;k++) c[k]=c[2*M-k]=ALPHA/(2*M+1);*/ /*for(k=M;k<2*M+1;k++) c[k]=c[2*M-k]=ALPHA/(2*(M+k))+1.0/36;*/ for(i=0;i<2;i++) past_P_max[i]=0; past_next=0; on_first_block=1; data_needed=1; input1=data_start=data1; input2=data2+block_size; i=0; while(iblock_size)&&data_needed) { data_needed=0; if(input1==data1) { input2=data2; input1=data1+block_size; } else { input2=data2+block_size; input1=data1; } i=0; while((iblock_size) { if(on_first_block==1) { on_first_block=0; data_start=data2; sound=data2+(sound-data1)-block_size; } else { on_first_block=1; data_start=data1; sound=data1+(sound-data2)-block_size; } data_needed=1; } /* gettime(&first);*/ g_max=0.0; P_max=999; limit=WINDOW_SIZE; for(k=0;kg_max) { P_max = P; g_max=g[P]; /* printf("g[%d] %lf\n",P,g[P]);*/ } } estimator=(P_max*auto_corr[0])/filter_length+g[P_max]; sum=0.0; for(k=0;k<3;k++) sum+=past_P_max[k]; sum/=3; past_P_max[past_next++]=P_max; if(past_next==3) past_next=0; if( P_max>P_LOW*4) harmonic=1; else if(abs(sum-P_max)<.7) harmonic=previous_harmonic; else if( P_max>P_LOW*4) { gp_threshold=g_max*GP_THRESHOLD; deriv_threshold=g_max*DERIV_THRESHOLD; for(P=P_LOW;Pgp_threshold) if(g[P]>g[P+1]) { sum_deriv_before=0.0; sum_deriv_after=0.0; pos_deriv=1; neg_deriv=1; for(i=0;i<3;i++) { if(pos_deriv==1) if( (deriv=(g[P-i]-g[P-i-1])) > 0.0) sum_deriv_before+=deriv; else pos_deriv=0; if(neg_deriv==1) if( (deriv=(g[P+i]-g[P+i+1])) < 0.0) sum_deriv_after+=deriv; else neg_deriv=0; } deriv_threshold=g[P]*DERIV_THRESHOLD; if((sum_deriv_before-sum_deriv_after) > deriv_threshold) { peak[P]=3; } } } harmonic=1; sixth=0; if(peak[(test=P_max/6)]==3) sixth=test; else if(peak[++test]==3) sixth=test; if(sixth!=0) { fifth=0; if(peak[(test=P_max/5)]==3) fifth=test; else { if(peak[++test]==3) fifth=test; } if(fifth==0) harmonic=3; else if(g[sixth]>g[fifth]) harmonic=3; } if(harmonic==1) { fourth=0; if(peak[(test=P_max/4)]==3) fourth=test; else if(peak[++test]==3) fourth=test; if(fourth!=0) { third=0; if(peak[(test=P_max/3)]==3) third=test; else if(peak[++test]==3) third=test; } if(third==0) harmonic=2; else if(g[fourth]>g[third]) harmonic=2; } else harmonic=1; } previous_harmonic=harmonic; fundamental_period=P_max/(harmonic-.00001); fprintf(stdout,"\nfund %4.1lf g[%d] = %8.1lf Power_ratio = %5.1lf\n",fundamental_period,P_max,g[P_max],100*estimator/auto_corr[0]); sum=0.0; for(k=1;k