Emotionally-Relevant Features for Classification and Regression of Music Lyrics

This research addresses the role of lyrics in the music emotion recognition process. Our approach is based on several state of the art features complemented by novel stylistic, structural and semantic features. To evaluate our approach, we created a ground truth dataset containing 180 song lyrics, according to Russell's emotion model. We conduct four types of experiments: regression and classification by quadrant, arousal and valence categories. Comparing to the state of the art features (ngrams - baseline), adding other features, including novel features, improved the F-measure from 69.9, 82.7 and 85.6 percent to 80.1, 88.3 and 90 percent, respectively for the three classification experiments. To study the relation between features and emotions (quadrants) we performed experiments to identify the best features that allow to describe and discriminate each quadrant. To further validate these experiments, we built a validation set comprising 771 lyrics extracted from the AllMusic platform, having achieved 73.6 percent F-measure in the classification by quadrants. We also conducted experiments to identify interpretable rules that show the relation between features and emotions and the relation among features. Regarding regression, results show that, comparing to similar studies for audio, we achieve a similar performance for arousal and a much better performance for valence.


INTRODUCTION
usic emotion recognition (MER) is gaining significant attention in the Music Information Retrieval (MIR) scientific community. In fact, the search of music through emotions is one of the main criteria utilized by users [1].
Real-world music databases from sites like AllMusic 1 or Last.fm 2 grow larger and larger on a daily basis, which requires a tremendous amount of manual work for keeping them updated. Unfortunately, manually annotating music with emotion tags is normally a subjective process and an expensive and timeconsuming task. This should be overcome with the use of automatic recognition systems [2].
Most of the early-stage automatic MER systems were based on audio content analysis (e.g., [3]). Later on, researchers started combining audio and lyrics, leading to bi-modal MER systems with improved accuracy (e.g., [2], [4] [5]). This does not come as a surprise since it is evident that the importance of each dimension (audio or lyrics) depends on music style. For example, in dance music audio is the most relevant dimension, while in poetic music (like Jacques Brel) lyrics are key.
Several psychological studies confirm the importance of lyrics to convey semantical information. Namely, according to Juslin and Laukka [6], 29% of people mention that lyrics are an important factor of how music expresses emotions. Also, Besson et al. [7] have shown that part of the semantic information of songs resides exclusively in the lyrics.
Despite the recognized importance of lyrics, current research in Lyrics-based MER (LMER) is facing the so-called glass-ceiling 1 AllMusic -http://www.allmusic.com/ [8] effect (which also happened in audio). In our view, this ceiling can be broken with recourse to dedicated emotion-related lyrical features. In fact, so far most of the employed features are directly imported from general text mining tasks, e.g., bag-ofwords (BOW) and part-of-speech (POS) tags, and, thus, are not specialized to the emotion recognition context. Namely, these state-of-the-art features do not account for specific text emotion attributes, e.g., how formal or informal the text language is, how the lyric is structured and so forth.
To fill this gap we propose novel features, namely:  Slang presence, which counts the number of slang words from a dictionary of 17700 words;  Structural analysis features, e.g., the number of repetitions of the title and chorus, the relative position of verses and chorus in the lyric;  Semantic features, e.g., gazetteers personalized to the employed emotion categories.
Additionally, we create a new, manually annotated, (partially) public dataset to validate the proposed features. This might be relevant for future system benchmarking, since none of the current datasets in the literature is public (e.g., [5]). Moreover, to the best of our knowledge, there are no emotion lyrics datasets in the English language that are annotated with continuous arousal and valence values.
The paper is organized as follows. In section 2, the related work is described and discussed. Section 3 presents the methods employed in this work, particularly the proposed features and ground truth. The results attained by our system are presented and discussed in Section 4. Finally, section 5 summarizes the main conclusions of this work and possible directions for future research.

RELATED WORK
The relations between emotions and music have been a subject of active research in music psychology for many years. Different emotion paradigms (e.g., categorical or dimensional) and taxonomies (e.g., Hevner, Russell) have been defined [9], [10] and exploited in different computational MER systems.
Identification of musical emotions from lyrics is still in an embryonic stage. Most of the previous studies related to this subject used general text instead of lyrics, polarity detection instead of emotion detection. More recently, LMER has gained significant attention by the MIR scientific community.
Feature extraction is one of the key stages of the LMER process. Previous works employing lyrics as a dimension for MER typically resort to content-based features (CBF) like Bag-Of-Words (BOW) [5], [11], [12] with possible transformations like stemming and stopwords removal. Other regularly used CBFs are Part-Of-Speech (POS) followed by BOW [12]. Additionally, linguistic and text stylistic features [2], are also employed.
Despite the relevance of such features and their possibility of use in general contexts, we believe they do not capture several aspects that are specific of emotion recognition in lyrics. Therefore, we propose new features, as will be described in Section 3.
As for systems based on manual annotations, it is difficult to compare them, since they all use different emotion taxonomies and datasets. Moreover, the employed datasets are not public. As for automatic approaches, frameworks like AllMusic or Last.fm are often employed. However, the quality of these annotations might be questionable because, for example in Last.fm, the tags are assigned by online users, which in some cases may cause ambiguity. In AllMusic, despite the fact that the annotations are made by experts [14], it is not clear whether they are annotating songs using only audio, lyrics or a combination of both.
Due to the limitations of the annotations in approaches like AllMusic and Last.fm and the fact that the datasets proposed by other researchers are not public, we decided to construct a manually annotated dataset. Our goal is to study the importance of each feature to the lyrics in a context of emotion recognition. So, the annotators have been told explicitely to ignore the audio during the annotations to measure the impact of the lyrics in the emotions. In the same way some researchers of the audio's area ask annotators to ignore lyrics, when they want to evaluate models focused on audio [15]. This all independently of in the process of audition we may use both dimensions. In the future we intend to fuse both dimensions and make a bimodal analysis. Additionally, to facilitate future benchmarking, the constructed dataset will be made partially public 3 , i.e., we provide the names of the artists and the song titles, as well as valence and arousal values, but not the song lyrics, due to copyright issues; instead we provide the URLs from where each lyric was retrieved.
Most current LMER approaches are black-box models instead of interpretable models. In [14], the authors use a human-comprehensible model to find out relations between features from General Inquirer (GI) and emotions. We use interpretable rules to match emotions and features not only from GI but from other 3 http://mir.dei.uc.pt/resources/MER_lyrics_dataset.zip types (e.g. Stylistic, Structural and Semantic) and platforms such as LIWC, ConcepNet and Synesketch.

Dataset Construction
As abovementioned, current MER systems either follow the categorical or the dimensional emotion paradigm. It is often argued that dimensional paradigms lead to lower ambiguity, since instead of having a discrete set of emotion adjectives, emotions are regarded as a continuum [11]. One of the most well-known dimensional models is Russell's circumplex model [16], where emotions are positioned in a two-dimensional plane comprising two axes, designated as valence and arousal, as illustrated in Figure 1. According to Russell [17], valence and arousal are the "core processes" of affect, forming the raw material or primitive of emotional experience.

Data Collection
To construct our ground truth, we started by collecting 200 song lyrics. The criteria for selecting the songs were the following:  Several musical genres and eras (see Table 1);  Songs distributed uniformly by the 4 quadrants of the Russell emotion model;  Each song belonging predominantly to one of the 4 quadrants in the Russell plane.
To this end, before performing the annotation study described in the next section, the songs were pre-annotated by our team and were nearly balanced across quadrants.
Next, we used the Google API to search for the song lyrics. In this process, three sites were used for lyrical information: lyrics.com, ChartLyrics and MaxiLyrics.
The obtained lyrics were then preprocessed to improve their quality. Namely, we performed the following tasks:  Correction of orthographic errors;  Elimination of songs with non-English lyrics;  Elimination of songs with lyrics with less than 100 characters;  Elimination of text not related with the lyric (e.g., names of the artists, composers, instruments).  Elimination of common patterns in lyrics such as [Chorus x2], [Vers1 x2], etc;  Complementation of the lyric according to the corresponding audio (e.g., chorus repetitions in the audio are added to the lyrics).
To further validate our system, we have also built a larger validation set. This dataset was built in the following way: 1. First, we mapped the mood tags from AllMusic into the words from the ANEW dictionary (ANEW has 1034 words with values for arousal (A) and valence (V)). Depending on the values of A and V, we can associate each word to a single Russell's quadrant. So, from that mapping, we obtained 33 words for quadrant 1 (e.g., fun, happy, triumphant), 29 words for quadrant 2 (e.g., tense, nervous, hostile), 12 words for quadrant 3 (e.g., lonely, sad, dark) and 18 words for quadrant 4 (e.g., relaxed, gentle, quiet). 2. Then, we considered that a song belongs to a specific quadrant if all of the corresponding AllMusic tags belong to this quadrant. Based on this requirement, we initially extracted 400 lyrics from each quadrant (the ones with a higher number of emotion tags), using the AllMusic's web service. 3. Next, we developed tools to automatically search for the lyrics files of the previous songs. We used 3 sites: Lyrics.com, ChartLyrics and MaxiLyrics. 4. Finally, this initial set was validated by three people.
Here, we followed the same procedure employed by Laurier [5]: a song is validated into a specific quadrant if at least one of the annotators agreed with AllMusic's annotation (Last.FM in his case). This resulted into a dataset with 771 lyrics (211 for Q1, 205 for Q2, 205 for Q3, 150 for Q4). Even though the number of lyrics in Q4 is smaller, the dataset is still nearly balanced.

Annotations and Validation
The annotation of the dataset was performed by 39 people with different backgrounds. To better understand their background, we delivered a questionnaire, which was answered by 62% of the volunteers. 24% of the annotators who answered the questionnaire have musical training and, regarding their education level, 35% have a BSc degree, 43% have an MSc, 18% a PhD and 4% have no higher-education degree. Regarding gender balance, 60% were male and 40% were female subjects. During the process, we recommended the following annotation methodology: 1. Read the lyric; 2. Identify the basic predominant emotion expressed by the lyric (if the user thought that there was more than one emotion, he/she should pick the predominant); 3. Assign values (between -4 and 4) to valence and arousal; the granularity of the annotation is the unit, which means that annotators could use 9 possible values to annotate the lyrics, from -4 to 4; 4. Fine tune the values assigned in 3) through ranking of the samples.
To further improve the quality of the annotations, the users were also recommended not to search for information about the lyric neither the song on the Internet or another place and to avoid tiredness by taking a break and continuing later.
We obtained an average of 8 annotations per lyric. Then, the arousal and valence of each song were obtained by the average of the annotations of all the subjects. In this case we considered the average trimmed by 10% to reduce the effect of outliers.
To improve the consistency of the ground truth, the standard deviation (SD) of the annotations made by different subjects for the same song was evaluated. Songs with an SD above 1.2 were excluded from the original set. As a result, 20 songs were discarded, leading to a final dataset containing 180 lyrics. This leads to a 95% confidence interval [18] of about ±0.4. We believe this is acceptable in our -4.0 to 4.0 annotation range. Finally the consistency of the ground truth was evaluated using Krippendorff's alpha [19], a measure of inter-coder agreement. This measure achieved, in the range -4 up to 4, 0.87 and 0.82 respectively for the dimensions valence and arousal. This is considered a strong agreement among the annotators.
One important issue to consider is how familiar are the lyrics to the listeners. 13% of the respondents reported that they were familiar with 12% of the lyrics (on average). Nevertheless, it seems that the annotation process was sufficiently robust regarding the familiarity issue, since there was an average of 8 annotations per lyric and the annotation agreement (Krippendorff's alpha) was very high (as discussed in the following chapters). This suggests that the results were not skewed.
Although the size of the dataset is not large, we think that is acceptable for experiments and is similar to other datasets manually annotated (e.g., [11] has 195 songs).   Finally, the distribution of lyrics across quadrants and genres is presented in Table 1. We can see that, except for quadrant 2 where almost half of the songs belong to the heavy metal genre, the other quadrants span several genres.

Emotion Categories
Finally, each song is labeled as belonging to one of the four possible quadrants, as well as the respective arousal hemisphere (north or south) and valence meridian (east or west). In this work, we evaluate the classification capabilities of our system in the three described problems. According to quadrants, the songs are distributed in the following way: quadrant 1 -44 lyrics; quadrant 2 -41 lyrics; quadrant 3 -51 lyrics; quadrant 4 -44 lyrics (see Table 1).
As for arousal hemispheres, we ended up with 85 lyrics with positive arousal and 95 with negative arousal.
Regarding valence meridian we have 88 lyrics with positive valence positive and 92 with negative valence.

Content-Based Features (CBF)
The most commonly used features in text analysis, as well as in lyric analysis, are content-based features (CBF), namely the bagof-words (BOW) [20].
In this model the text in question is represented as a set of bags which normally correspond, in most cases, to unigrams, bigrams or trigrams. The BOW are normally associated to a set of transformations such as stemming and stopwords removal which are applied immediately after the tokenization of the original text. Stemming allows each word to be reduced to its stem and it is assumed that there are no differences, from the semantic point of view, in words which share the same stem. Through stemming the words "argue", "argued", "argues", "arguing" e "argus" would be reduced to the same stem "argu". The stopwords (e.g., the, is, in, at) which may also be called as function words are very common words in a certain language. These words bring normally little knowledge. The words include mainly determiners, pronouns and other gramatical particles which, by their frequency in a large quantity of documents, are not discriminative. The BOW may also be applied without any of the prior transformations. This technique was used, for example, in [12].
Part-of-speech (POS) tags are another type of state-of-art features. They consist in attributing a corresponding grammatical class to each word. For example the grammatical tagging of the following sentence "The student read the book" would be 4 http://onlineslangdictionary.com/ "The/DT student/NN read/VBZ the/DT book/NN", where DT, NN and VBZ mean respectively determiner, noun and verb in 3rd person singular present. The POS tagging is typically followed by a BOW analysis. This technique was used in studies such as [21].
In our research we use all the combinations of unigrams, bigrams, trigrams with the aforementioned transformations. We also use n-grams of POS tags from bigram to 5-grams.

Stylistic-Based Features (StyBF)
These features are related to stylistic aspects of the language. One of the issues related to the written style is the choice of the type of the words to convey a certain idea (or emotion, in our study). Concerning music, those issues can be related to the style of the composer, the musical genre or the emotions that we intend to convey.
We use 36 features representing the number of occurrences of 36 different grammatical classes in the lyrics. We use the POS tags in the Penn Treebank Project [22] such as for instance JJ (adjectives), NNS (noum plural), RB (adverb), UH (interjection), VB (verb). Some of these features are also used by authors like [12].
We use two features related to the use of capital letters: All Capital Letters (ACL), which represents the number of words with all letters in uppercase and First Capital Letter (FCL), which represents the number of words initialized by an uppercase letter, excluding the first word of each line.
Finally, we propose a new feature: the number of occurrences of slang words (abbreviated as #Slang). These slang words (17700 words) are taken from the Online Slang Dictionary 4 (American, English and Urban Slang). We propose this feature because, in specific genres like hip-hop, the ideas are expressed normally with a lot of slang, so we believe that this feature may be important to describe specific emotions associated to specific genres.

Song-Structure-Based Features (StruBF)
To the best of our knowledge, no previous work on LMER employs features related to the structure of the lyric. However, we believe this type of features has relevance for LMER. Hence, we propose novel features of this kind, namely:  #CH, which stands for the number of times the chorus is repeated in the lyric;  #Title, which is the number of times the title appears in the lyric. Common sense says, for example, that normally more danceable songs have more repetitions of the chorus. We believe that the different structures that a lyric may have, are taken into account by the composers to express emotions. That is the reason why we propose these features.

Semantic-Based Features (SemBF)
These features are related to semantic aspects of the lyrics. In this case, we used features based on existing frameworks like Synesketch 5 (8 features), ConceptNet 6 (8 features), LIWC 7 (82 features) and GI 8 (182 features).
In addition to the previous frameworks, we use features based on known dictionaries: DAL [23] and ANEW [24]. From DAL (Dictionary of Affect in Language) we extract 3 features which are the average in lyrics of the dimensions pleasantness, activation and imagery. Each word in DAL is annotated with these 3 dimensions. As for ANEW (Affective Norms for English Words) we extract 3 features which are the average in lyrics of the dimensions valence, arousal and dominance. Each word in ANEW is annotated with these 3 dimensions.
Additionally, we propose 14 new features based on gazetteers, which represent the 4 quadrants of the Russell emotion model. We constructed the gazetteers according to the following procedure: 1. We define as seed words the 18 emotion terms defined in Russell's plane (see figure 1 in the article). 2. From the 18 terms, we consider for the gazetteers only the ones present in the DAL or the ANEW dictionaries.
In DAL, we assume that pleasantness corresponds to valence, and activation to arousal, based on [25]. We employ the scale defined in Dal: arousal and valence (AV) values from 1 to 3. If the words are not in the DAL dictionary but are present in ANEW, we still consider the words and convert the arousal and valence values from the ANEW scale to the DAL scale. 3. We then extend the seed words through Wordnet Affect [26], where we collect the emotional synonyms of the seed words (e.g., some synonyms of joy are exuberance, happiness, bonheur and gladness). The process of assigning the AV values from DAL (or ANEW) to these new words is performed as described in step 2. 4. Finally, we search for synonyms of the gazetteer's current words in Wordnet and we repeat the process described in step 2.
Before the insertion of any word in the gazetteer (from step 1 on), each new proposed word is validated or not by two persons, according to its emotional value. There should be unanimity between the two annotators. The two persons involved in the validation were not linguistic scholars but were sufficiently knowledgeable for the task.  Overall, the resulting gazeteers comprised 132, 214, 78 and 93 words respectively for the quadrants 1, 2, 3 and 4.
The features extracted are:  VinGAZQ1 (average valence of the words present in the lyrics that are also present in the gazetteer of the quadrant 1);  AinGAZQ1 (average arousal of the words present in the lyrics that are also present in the gazetteer of the quadrant 1);  VinGAZQ2 (average valence of the words present in the lyrics that are also present in the gazetteer of the quadrant 2);  AinGAZQ2 (average arousal of the words present in the lyrics that are also present in the gazetteer of the quadrant 2);  VinGAZQ3 (average valence of the words present in the lyrics that are also present in the gazetteer of the quadrant 3);  AinGAZQ3 (average arousal of the words present in the lyrics that are also present in the gazetteer of the quadrant 3);  VinGAZQ4 (average valence of the words present in the lyrics that are also present in the gazetteer of the quadrant 4);  AinGAZQ4 (average arousal of the words present in the lyrics that are also present in the gazetteer of the quadrant 4);  #GAZQ1 (number of words of the gazetteer 1 that are present in the lyrics);  #GAZQ2 (number of words of the gazetteer 2 that are present in the lyrics);  #GAZQ3 (number of words of the gazetteer 3 that are present in the lyrics);  #GAZQ4 (number of words of the gazetteer 4 that are present in the lyrics);  VinGAZQ1Q2Q3Q4 (average valence of the words present in the lyrics that are also present in the gazetteers of the quadrants 1, 2, 3, 4);  AinGAZQ1Q2Q3Q4 (average arousal of the words present in the lyrics that are also present in the gazetteers 6 IEEE TRANSACTIONS ON JOURNAL AFFECTIVE COMPUTING, MANUSCRIPT ID of the quadrants 1, 2, 3, 4).

Feature grouping
The proposed features are organized into four different feature sets: CBF. We define 10 feature sets of this type: 6 are BOW (1gram up to 3-grams) after tokenization with and without stemming (st) and stopwords removal (sw); 4 are BOW (2-grams up to 5-grams) after the application of a POS tagger without st and sw. These BOW features are used as the baseline, since they are a reference in most studies [2], [27].
StyBF. We define 2 feature sets: the first corresponds to the number of occurrences of POS tags in the lyrics after the application of a POS tagger (a total of 36 different grammatical classes or tags); the second represents the number of slang words (#Slang) and the features related to words in capital letters (ACL and FCL).
StruBF. We define one feature set with all the structural features.
SemBF. We define 4 feature sets: the first with the features from Synesketch and ConceptNet; the second with the features from LIWC; the third with the features from GI; and the last with the features from gazetteers, DAL and ANEW.
We use the term frequency and the term frequency inverse document frequency (tfidf) as representation values in the datasets.

Classification and Regression
For classification and regression, we use Support Vector Machines (SVM) [28], since, based on previous evaluations, this technique performed generally better than other methods. A polynomial kernel was employed and a grid parameter search was performed to tune the parameters of the algorithm. Feature selection and ranking with the ReliefF algorithm [29] were also performed in each feature set, in order to reduce the number of features. In addition, for the best features in each model, we analyzed the resulting feature probability density functions (pdf) to validate the feature selection that resulted from ReliefF, as described below.
For both classification and regression, results were validated with repeated stratified 10-fold cross validation [30] (with 10 repetitions) and the average obtained performance is reported. Since we performed a very high number of experiments and each task uses different settings, it is not possible to present the employed parameters. We present, as an example, only the parameters for the validation dataset (771 lyrics) in section 4.2.1.

Regression Analysis
The regressors for arousal and valence were applied using the feature sets for the different types of features (e.g., SemBF). Then, after feature selection, ranking and reduction with the Re-liefF algorithm, we created regressors for the combinations of the best feature sets.
To evaluate the performance of the regressors the coefficient of determination 2 R [31] was applied. This is a statistic that gives information about the goodness of fit of a model. This measure indicates how well data fit a statistic model. If value is 1, the model perfectly fits the data. A negative value indicates that the model does not fit the data at all. The results were 0.59 (with 234 features) for arousal and 0.61 (with 340 features) for valence. The best results were achieved always with RBFKernel [32].
Yang [11] made an analogous study using a dataset with 195 songs (using only the audio). He achieved a 2 R score of 0.58 for arousal and 0.28 for valence. We can see that we obtained almost the same results for arousal (0.59 vs 0.58) and much better results for valence (0.61 vs 0.28). Although direct comparison is not possible, these results suggest that lyrics analysis is likely to improve audio-only valence estimation. Thus, in the near future, we will evaluate a bi-modal analysis using both audio and lyrics.
In addition, we used the obtained arousal and valence regressors to perform regression-based classification (discussed below).

Classification Analysis
We conduct three types of experiments for each of the defined feature sets: i) classification by quadrant categories; ii) classification by arousal hemispheres; iii) and classification by valence meridians.

Classification By Quadrant Emotion Categories
We can see in the following table (see Table 3) the performance of the best models for each one of the features categories (e.g., CBF). For CBF, we considered for example the two best models (M11 and M12). The field #Features-SelFeatures-FMeasure(%) represents respectively the total of features, the number of selected features and the results accomplished via the F-measure metric after feature selection. In the table above, M1x stands for models that employ CBF features, M2x represents models with StyBF features, M3x StruBF features and M4x SemBF features. The same code is employed in the tables in the following sections.
The model M41 is not significantly better comparing to M11, but is significantly better than the model M42 (at p < 0.05). As for statistical significance we use the Wilcoxon rank-sum test.
As we can see, the two best results were achieved with features from the state-of-the-art, namely BOW and LIWC. The results were close to the novel semantic features in M42 (65.3%). The results of the other novel features (M22 and M31) were not so good in comparison to the baseline at least when evaluated in isolation. Table 4 shows the results of the combination the best models for each of the features categories. For example C1Q is the combination of the CBF's best models after feature selection, i.e., initially, for this category, we have 10 different models (see section 3.2.5). After feature selection, the models are combined (only the selected features) and the result is C1Q. Then C1Q has 900 features and after feature selection we got a result of 69.9% for Fmeasure. The classification process is analogous for the other categories.
In Table 4, #Features represents the total of features of the model, Selected Features is the number of selected features and Fmeasure represents the results accomplished via the F-measure metric. As we can see, the combination of the best models of BOW (baseline) keep the results close to the 70% (model C1Q) with a high number of features selected (812). The results of the SemBF (C4Q) are significantly better since we obtain a better performance (76.20%) with much less features (39). It seems that the novel features (M42) have an important role in the overall improvement of the SemBF since the overall results for this type of features is 76.20% and the best semantic model (LIWC) achieved 71.10%.
The mixed classifier (80.1%) is significantly better than the best classifiers by type of feature: C1Q, C2Q, C3Q and C4Q (at p < 0.05). These results show the importance of the new features for the overall results.
Additionally, we performed regression-based classification based on the above regression analysis. An F-measure of 76.1% was achieved, which is close to the quadrant-based classification. Hence, training only two regressor models could be applied to both regression and classification problems with reasonable accuracy.
Finally, we trained the 180-lyrics dataset using the mixed C1Q+C2Q+C3Q+C4Q features, and validated the resulting model using the new larger dataset 9 (comprising 771 lyrics). We obtained 73.6% F-measure, which shows t hat our model, trained in the 180-lyrics dataset, generalizes reasonably well. The parameters used for the SVM classifier with polynomial kernel were 2 for the complexity parameter (C) and 0.6 for the exponent value of the polynomial kernel. 9 http://mir.dei.uc.pt/resources/Dataset-Allmusic-771Lyrics.zip

Classification by Arousal Hemispheres
We perform the same study for the classification by arousal hemispheres. Table 5 shows the results attained by the best models for each feature set. The best results (83.90%) are obtained for trigrams after POS (M12). This suggests that the way the sentences are constructed, from a syntactic point of view, can be an important indicator for the arousal hemispheres of the lyrics. The trigram vb+prp+nn is an example of an important feature for this problem (taken from the ranking of features of this model). In this trigram, "vb" is a verb in the base form, "prp" is a preposition and "nn" is a noun.
The novel features in StruBF (M31) and StyBF (M22) achieved respectively 70.2% with 8 features and 71.30% with 2 features. These results are above some state-of-the-art features like the features in M44 and these results are accomplished with few features (2 and 8 respectively). The results of the novel features in M42 seem promising since they are close to the best model M12 and with similar values compared to known platforms like LIWC and GI and with less features (8 to 50 and 70 respectively for LIWC and GI). The model M12 is significantly better than the other classifiers (at p < 0.05). Table 6 shows the combinations by feature sets and the combination of the combinations respectively. Comparing to best state of the art features (BOW), the best results with the combinations were improved from 82.7% to 88.3%. The mixed classifier (88.3%) is significantly better than best classifiers by type of feature: C1A, C2A, C3A and C4A (at p < 0.05), showing again the key role of the novel features.

Classification by Valence Meridians
We perform the same study for the classification by valence meridian. The following table (Table 7) shows the results of the best models by type of features. These results show the importance of the semantic features in general, since the semantic models (M41, M42, M43) are significantly better than the classifiers of the other types of features (at p < 0.05). Features related with the positivity or negativity of the words such as VinDAL or posemo (positive words) have an important role to these results. Table 8 shows the combinations by feature sets and the combination of the combinations respectively. In comparison to the previous studies (quadrants and arousal), these results are better in general. We can see this in the BOW experiments (baseline-85.60%) where we achieved a performance close to the best combination (C4V). The best results are also in general achieved with less features as we can see in C3V and C4V.
The mixed classifier (90%) is significantly better than the best classifiers by type of feature: C1V, C2V, C3V and C4V (at p < 0.05).

Binary Classification
As a complement to the multiclass problem seen previously, we also evaluated a binary classification (BC) approach for each emotion category (e.g., quadrant 1). Negative examples of a category are lyrics that were not tagged with that category but were tagged with the other categories. For example (see Table 9 The results in Table 9 were reached using 396, 442, 290 and 696 features, respectively for the four sets of emotions (quadrants). The good performance of these classifiers, namely for quadrant 2, indict that the prediction models can capture the most important features of these quadrants.
The analysis of the most important features by quadrant will be the starting point for the identification of the best features by sets of emotions or quadrants, as detailed in section 4.4.

New Features: Comparison to Baseline
Considering CBF as the baseline in this area, we though it would be important to assess the performance of the models created when we add to the baseline the new proposed features. The new proposed features are contained in three categories: StyBF (feature set M22), StruBF (feature set M31) e SemBF (feature set M42). Next, we created new models adding to C1* each one of the previous feature sets in the following way: C1*+M22; C1*+M31; C1*+M42; C1*+M22+M31+M42. In C1*, 'C1' denotes a feature set that contains the combination of the best Content-Based Features -baseline and '1' denotes CBF, as mentioned above; "*" denotes expansion notation, indicating the different experiments conducted: Q denotes classification by quadrants, A by arousal hemispheres and V by valence meridians. These models were created for each of the 3 classification problems seen in the previous section: Classification by quadrants (see Table 10); classification by arousal (see Table 11); classification by valence (see Table 12). The baseline model (C1Q) alone reached 69.9% with 812 features selected (Table 4). We improve the results with all the combinations but only the models C1Q+M42 and C1Q+M22+M31+M42 are significantly better than the baseline model (at p < 0.05). However the model C1Q+M22+M31+M42 is significantly better (at p < 0.05) than the model C1Q+M42. This shows that the inclusion of StruBF and StyBF have improved overall results. The baseline model (C1A) alone reached an F-measure of 82.7% with 1098 features selected (Table 6). We improve the results with all the combinations but only the models C1A+M42 and C1A+M22+M31+M42 are significantly better than the baseline model (at p < 0.05). The inclusion of the features from M22 and M31 in C1A+M22+M31+M42 improved the performance in comparison to the model C1A+M42, since C1A+M22+M31+ M42 is significantly better than the model C1A+M42 (at p < 0.05). The baseline model (C1V) alone reached an F-measure of 85.6% with 750 features selected (Table 8). We improve the results with all the combinations but only the models C1V+M42 and C1V+M22+M31+M42 are significantly better than the baseline model (at p < 0.05), however C1V+M22+M31+M42 is not significantly better than C1V+M42. This suggests the importance of the SemBF for this task in comparison to the other new features.
In general, the new StyBF and StruBF are not good enough to improve significantly the baseline score, however we got the same results with much less features: for classification by quadrants we decrease the number of features of the model from 812 (baseline) to 384 (StyBF) and 466 (StruBF). The same happens for arousal classification (1098 features -baseline to 652 -StyBF and 373 -StruBF) and for valence classification (750 features -baseline to 679 -StyBF and 659 -StruBF). However, the model with all the features is always better (except for valence classification) than the model with only baseline and SemBF. This shows a relative importance of the novel StyBF and StruBF. It is important to highlight that M22 has only 3 features and M31 has 12 features.
The new SemBF (model M42) seems important because it can improve clearly the score of the baseline. Particularly in the last problem (classification by valence) it requires a much less number of features (750 down to 88).

Best Features by Classification Problem
We determined in the previous section the classification models with best performance for the several classification problems. These models were built through the interaction of a set of features (from the total of features after feature selection). Some of these features are possibly strong to predict a class when they are alone but others are strong only when combined with other features.
Our purpose in this section is to identify the most important features, when they act alone, for the description and discrimination of the problem's classes.
We will determine the best features for:  Arousal (Hemispheres) description -the classes used are negative arousal (AN) and positive arousal (AP)  Valence (Meridians) description -negative valence (VN) and positive valence (VP)  Arousal when valence is positive -negative arousal (AN) and positive arousal (AP), which means quadrant 1 vs quadrant 4  Arousal when valence is negative -negative arousal (AN) and positive arousal (AP), which means quadrant 2 vs quadrant 3  Valence when arousal is positive -negative valence (VN) and positive valence (VP), which means quadrant 1 vs quadrant 2  Valence when arousal is negative -negative valence (VN) and positive valence (VP), which means quadrant 3 vs quadrant 4 In all the situations we identify the 5 features that, after analysis, seem the best features. This analysis starts from the rankings (top 20) of the best features extracted from the models of the section 4.2, with ReliefF. Next, to validate ReliefF's ranking, we compute the probability density functions (pdf) [31] for each of the classes of the previous problems. Through the analysis of these pdfs we take some conclusions about the description of the classes and identify some of their main characteristics.
The images below show the pdfs of 2 of the 5 best features for the problem of valence description when the arousal is positive (distinguish between 1 st quadrant and 2 nd quadrant) (Figure 4). The features are M44-Anger_Weight_Synesketch (a) and M42-Di-nANEW (b).  As we can see, the feature in the top image is more important for discriminating between the 1 st and 2 nd quadrants than the feature in the second image, because the density functions (f) are more separated. We use one measure (2) that indicates this separation: Intersection_Area, which represents the intersection area (in percentage) between the two functions.
In (2), A and B are the compared classes (VN and VP in the example of the Figure 4) and A f and B f are respectively the pdfs for A and B.
For this measure, lower values indicate more separation between the curves.
Both features are important to describe the quadrants. The first, taken from the Synesketch framework measures the weight of anger in the lyrics and, as we can see, it has higher values for the 2 nd quadrant as expected, since anger is a typical emotion from the 2 nd quadrant. The 2 nd feature represents the average dominance of the ANEW's words in the lyrics and, although some overlap, it shows that predominantly higher values indicate the 1 st quadrant and lower values indicate the 2 nd quadrant.
Based on above metric, the top-5 best features were identified for each problem, i.e., the features that separate better the different problems.

Best Features for Arousal Description
As we can see (Table 13), the two best features to discriminate between arousal hemispheres are new features proposed by us. FCL represents the number of words started by a capital letter and it describes better the class AP than the class AN, i.e., lyrics with FCL greater than a specific value correspond normally to lyrics from the class AP. For low values there is a mix between the 2 classes. The same happens to #Slang, #Title, WC (word count -LIWC), active (words with active orientation -GI) and vb (number of verbs in the base form). The feature negate (number of negations -LIWC) has an opposite behavior, i.e., mix between classes for lower values and the class AN from a specific point. The features not listed above, sad (words of the negative emotion sadness -LIWC), angry (angry weight in ConcepNet) and numb (words indicating the assessment of quantity, including the use of numbers -GI) have a similar pattern of behavior as the feature negate, while the novel features CH (number of repetitions of the chorus) and TotalVorCH (number of repetitions of verses or chorus) have similar pattern of behavior as the feature FCL.

Best Features for Valence Description
The best features and not only the 5 on  As can be seen in Table 15

Best Features for Arousal when Valence is Negative
These features are summarized in Table 16. The features An-ger_Weight_Synesketch and Disgust_Weight_Synesketch (weight of the emotion disgust) are good to discriminate between the quadrants 2 and 3 (higher values are associated as it was predictable to instances from the quadrant 2), although in the latter we have more overlap between the classes than in the prior. The features vbp (verb, non-3rd person singular present) and anger can discriminate the class AP (higher values) but for lower values we have a mix between the classes. Other features with similar behavior are FCL, #Slang, negativ (negative words -GI), cc (number of coordinating conjunctions) and #Title. AinGAZQ2 and past can discriminate the 3 rd quadrant, i.e., the class AN. Finally the feature article (the number of definite, e.g., the, and indefinite, e.g., a, an, articles in the text) can discriminate both quadrants (tendency for 3 rd quadrant with lower values and 2 nd quadrant with higher values). The feature Anger_Weight_Synesketch is clearly discriminative to separate the quadrants 2 and 3 (see Table 17 and Figure 4). The novel semantic features VinANEW, VinGAZQ1Q2Q3Q4, VinDAL and DinANEW have a similar pattern behavior to the first feature but with a little overlap between the functions. The features negemo (negative emotion words -LIWC), swear (swear words -LIWC), negative (words of negative outlook -GI) and hostile (words indicating an attitude or concern with hostility or aggressiveness -GI) are good for the discrimination of the 2 nd quadrant (higher values). The best features for valence discrimination when arousal is negative are presented in Table 18. Between the quadrants 3 and 4, the features vbd, I, self and motion are better for the 3 rd quadrant discrimination, while the features #GAZQ4, article, cc and posemo are better for 4 th quadrant discrimination.

Best Features by Quadrant
Until now we have identified features important to discriminate, for example, between two quadrants. Next, we will evaluate if these features can discriminate completely the four quadrants, i.e., one quadrant against the other three.
To evaluate the quality of the discrimination of a specific feature concerning a quadrant Qz, we have established a metric based on two measures:  Discrimination support (support of a function is the set of points where the function is not zero-valued [33]), which corresponds to the difference between the total support of the two pdf (Qz and Qothers) and the support of the Qothers pdf, as defined in (3). The result is the support of the Qz pdf except the support of the intersection area and is in percentage of the total support. The higher this metric the better; Among the features that best represent each quadrant, we have features from the state of the art, such as features, from LIWC (M41)humans (references to humans), anger (affect words), negemo (negative emotion words), WC (word count), negate (negations), cogmech (cognitive processes), Dic (dictionary words) and hear (hearing perceptual process); from GI (M43) -socrel (words for socially-defined interpersonal processes), solve (words referring to the mental processes associated with problem solving), passive (words indicating a passive orientation), negativ (negative words) and hostile (words indicating an attitude or concern with hostility or aggressiveness); from Concep-Net (M44) -happy_CN (happy weight), CN_A (arousal weight); from POS Tags (M21)vbp (verb, non-3rd person singular present), vbg (verb, gerund or present participle), nn (noun, singular or mass), dt (determiner), cc (coordinating conjunction) and prp (personal pronoun). We have also novel features, such as, StyBF (M22) -#Slang and FCL; StruBF (M31) -#Title and TotalVorCH; SemBF (M42) -#GAZQ1, #GAZQ3, VinGAZQ1Q2Q3Q4, #GAZQ4 and DinDAL.
Some of the more salient characteristics of each of the quadrants:  Q1: typically lyrics associated to songs with positive emotions and high activation. Songs from this quadrant are often associated to specific musical genres, such as, dance, pop and by the importance of the features we point out the features related with repetitions of the chorus and title in the lyric.  Q2: we point out stylistic features such as #Slang and FCL that indict high activation with predominance of negative emotions or features that are related with negative valence such as negativ (negative words), hostile (hostile words) and swear (swear words). This kind of features influence more Q2 than Q3 (although Q3 have also negative valence) because Q2 is more influenced by specific vocabulary such as the vocabulary in that features, while Q3 is more influenced by negative ideas, so we think that it is more difficult the perception of emotions in the 3 rd quadrant.  Q3: we point out the importance of the verbal tense (past) in comparison with the other quadrants which have the predominance of the present tense. On the contrary, Q2 have also some tendency to the gerund tense and the Q1 to the present simple. We highlight also in comparison with the other quadrants more use of the 1 st singulier person (I).  Q4: Features related with activation, as we have seen for the quadrants 1 and 2, have low weight for this quadrant. We point out the importance of a specific vocabulary as we have in #GAZQ4.
Generally, semantic features are more important to discriminate the valence (e.g. VinDAL, VinANEW). Features important for sentiment analysis such as posemo (positive words) or ngtv (negative words) are also important for valence discrimination.
On the other hand, stylistic features related with the activation of the written text such as #Slang or FCL are important for arousal discrimination. Features related with the weight of emotions in the written text are also important (e.g. An-ger_Weight_Synesketch, Disgust_Weight_Synesketch).

Interpretability
After we have made a study to understand the best features to describe and discriminate each set of emotions, we are going to extract some rules/knowledge that allow to understand how these features and emotions are related. With this study we intend to attain two possible goals: i) find out relations between features and emotions (e.g., if feature A is low and feature B is high then the song lyrics belong to quadrant 2); ii) find out relations among features (e.g., song lyrics with feature A high also have feature B low).

Relations between features and quadrants
In this analysis we use the Apriori algorithm [34].
First, we pre-processed the employed features through the detection of features with a nearly uniform distribution, i.e., the feature values depart at most 10% from the feature mean value. We did not consider these kind of features. Here, we employed all the features selected in Mixed C1Q + C2Q + C3Q + C4Q model (see Table 4), except for the ones excluded as described.
In total, we employed 144 features.
Then we defined the following premises.  Consideration of only rules up to 2 antecedents. It was applied an algorithm to eliminate redundance, considering the more generic rules to avoid complex rules;  Due to the fact that n-grams features are sparse, we did not consider rules with part of the antecedent of type n-gram = Very Low. It means probably that the feature does not exist;  Features were discretized in 5 classes using equal-frequency discretization: very low (VL), low (L), medium (M), high (H), very high (VH). Rules containing nonuniform distributed features were ignored. We considered two measures to assess the quality of the rules: confidence and support. The ideal rule has simultaneously high representativity (support) and high confidence degree. Table 20 shows up the best rules for quadrants. We defined a threshold of support = 8.3% (15 lyrics) and confidence = 60%.
We think this rules are in general self-explanatory and understandable, however we will explain some of them not so explicit.
We can see for Q1 the importance of the feature #GAZQ1 together with the feature from GI, afftot (words in the affect domain), both with VH values. We can also highlight for this quadrant the relation between a VL weight for sadness and a VH value for the feature positiv (words of positive outlook) and the relation between a VH number of title's repetitions in the lyric and a VL weight for the emotion angry.
We can point out for quadrant 2 the importance of the features anger from LIWC and Synesketch, negemo_GI (negative emotion), #GAZQ2, VinANEW, hostile (words indicating an attitude or concern with hostility or aggressiveness), powcon (words for ways of conflicting) and some combinations among them.
For quadrant 3, we can point out the relation between a VH value for the emotion sadness and a VL value for the number of swear words in the lyrics.
For quadrant 4 we can point out the relation between the features anger and weak (words implying weakness) both with VL values.
These results confirm the results reached in the previous section, where we identified the most important features for each quadrant.