Computational-based Author Discrimination can be used to identify whether two different texts are authored by the same author or different authors.
Textual scholars use this approach to identify the authorship of books attributed to a certain author. For example, New Testament scholars analyse the letters attributed to Paul in terms of their vocabulary, style, structure, patterns and mindset along with a number of other features. They conclude all of the letters could not be authored by the same author because the linguistic styles in the various letters are dramatically different.
In this answer, I am going to consider two important studies applied to the Qur'an and hadith authored by Halim Sayoud from USTHB University, Algiers. The first focuses on linguistic analyses and the second on visual analyses.
In 2012 a study was published by the Literary and Linguistic Computing journal seeking to identify whether the Qur'an was authored by Muhammed (saw) based on computational based author discrimination.
The study comprised sixteen experiments comparing expressions, content analysis (eg citation of animals), variant common words and characters, frequency of words, citation of numbers, special ending bigrams, similar vocabulary, and others features used in the narrated hadiths (Bukhari was used) and the Qur'an.
They found dramatic differences between the two books, and internal consistency in each book, which lead to the conclusion 'the two books must have two different authors; i.e. the Prophet (saw) can't be the author of the Qur'an.'
The study commented:
"Thus, three series of experiments are done and commented on.
The first series of experiments analyses the two books in a global form ... It concerns nine different experiments.
The second series of experiments analyses the two books in a segmental form (four different segments of text are extracted from every book). It concerns five different experiments.
The third series of experiments makes an automatic authorship attribution of the two books in a segmental form by employing several classifiers and several types of features. The sizes of the segments are more or less in the same range (four different text segments ...). It concerns two different experiments.
Discussion of Experiments
In global analyses, it performed nine experiments. One experiment used "word frequency-based analysis". This comprised a discriminative word can be seen as a word that is frequently used in one text and rarely employed in the other, which could represent a sample word that can be used for discriminating the two texts.
This experiment found frequent words used in the Qur'an were not used in the Hadith and vice versa, the two books adopting different vocabulary style.
Another experiment deals with words that are present in one book and absent in the other. The results of this experiment show:
"62% of the Bukhari hadith words are untraceable in the Quran and 83% of the Quran words are untraceable in the Bukhari Hadith … Practically, it is impossible for a same author to write two books (related to a similar topic) with a so great difference in the vocabulary. Therefore, we can deduce that the two books should come from two authors who are characterized by two different vocabularies."
For the style experiment, COST parameter is used which focuses on the termination similarity between neighbouring sentences of a given text, such as the same final syllable or letter. According to this analysis:
"We remark that for the Hadith mixture, there are many COST values equal to zero; and when the COST is non-null, it has very small values: the average COST is only 0.46. For the Quran, we notice that the COST is almost never null, and the corresponding values are relatively high: the average COST of the Quran is approximately 2.52. This fact means that the structure of the Quran is very different from the Hadith one."
Character frequency-based analysis is another experiment comparing the character frequencies of the two books. The analysis found two different writing styles for the two books.
Investigating citation of numbers in the texts, the most frequently cited number in the Quran was the number '1', whereas for the hadiths it was the number '3'. Both books use more odd numbers than even ones, except for the Quran's usage of the number '5'.
The citation of animals, twenty-nine in total, quoted in the Quran are completely absent in the hadith.
"We remark that several animal names are not cited in the Bukhari Hadith and particularly the name عجل (calf), which is cited ten times in the Quran and which is completely absent in the Bukhari Hadith … we quote the animals that are quoted in the Bukhari Hadith but completely absent in the Quran. There are eleven such animal names. A particular observation can be done about the name (sheep), which is cited ten times in the Bukhari Hadith and which is completely absent in the Quran."
An experiment deals with special ending bigrams that are often used in Arabic. They found a significant difference in ending bigrams, in the Quran the frequency was relatively high, and the hadith saw the frequency relatively low.
A powerful experiment in the series of segmental analyses was 'vocabulary-based similarity'. The study found the intra-similarities (within the same book) to be high: between 26% and 31%; the inter-similarities (segments from different books) was relatively low: not exceeding 20%. The study observed:
"the four segments of the Quran have a great similarity in vocabulary and the four segments of the Hadith have a great similarity in vocabulary, too. On the other hand, it implies a low similarity between the vocabulary styles of the two different books. … This observation shows that all the segments of the same book appear to have a unique origin and that the two books should have two different author styles."
The third series of experiments consisted of automatic authorship attribution, where the author employed software to automatic classify eight texts by using different features and classifiers. Statistically the four Qur'anic segments belong to one author, the four hadith segments belong to a second author and both authors were likely to be different.
The importance of the last two sections is 'the whole of each book (Qur'an and Hadith) must have been authored by one author.' This means the Qur'an has not been corrupted by others and the hadith is not a later fabrication of some scholars.
This study concluded:
"Consequently, we can conclude, according to this investigation, that the Quran was not written by the Prophet Muhammad and that it belongs to a unique author too. Muslims believe that it is written by Allah (God) and sent to his messenger (the prophet Muhammad)."
In 2015, the second study was published in the 6th International Conference on Information Visualization Theory and Applications. It presented a visual analytics-based investigation of both the Qur'an and the Hadith.
For this purpose, two visual analytics clustering methods were employed, namely: Hierarchical Clustering (a method of cluster analysis which sought to build a hierarchy of clusters) and Fuzzy C-mean Clustering (an automatic clustering technique in which the allocation of data points to clusters) along with several other types of features were extracted.
The Qur'an and the Sunnah were segmented into 25 several text segments (14 for the Quran and 11 for the Hadith). If applying the above two methods resulted in only one cluster, this meant that the different texts were probably written by the one author.
On the other hand, if several clusters emerged, Qur'anic texts grouped with hadith ones (in the same cluster), this would mean some Qur'anic texts were probably written by the hadith author.
However, if two clusters appear in the clustering display and all the Qur'anic texts are grouped in one cluster and all the hadith texts are grouped in another distinct cluster, this will imply that the two books (Quran and Hadith) are written by two different authors.
The result of Hierarchical clustering shows two separated sharp clusters with no intersection between the different clusters. Thus, there are probably two authors: an author of the Qur'an and an author of the hadith.
The result of the Fuzzy C-mean clustering shows two main clusters: a Qur'an cluster located at the top right area and hadith cluster located at the bottom left area of the 3D representation.
The two sets of text segments were automatically organised into two sharp clusters proving two separate authors: a Qur'an Author and a hadith author, both different in their linguistic style.
The linguistic and stylistic analyses prove the Qur'an and the hadith as two independent texts authored by two different authors with different styles in their vocabulary, linguistic structures (bigrams, expressions, word frequency etc) and mindset.
Sayoud, H. (2012) Author discrimination between the Holy Quran and Prophet's statements, in Literary and Linguistic Computing 27(4): 427-444, DOI: 10.1093/llc/fqs014
Sayoud H. (2015). A Visual Analytics based Investigation on the Authorship of the Holy Quran. In Proceedings of the 6th International Conference on Information Visualization Theory and Applications - Volume 1: IVAPP, (VISIGRAPP 2015) ISBN 978-989-758-088-8, pages 177-181. DOI: 10.5220/0005355601770181
Great answers start with great insights. Content becomes intriguing when it is voted up or down - ensuring the best answers are always at the top.
Questions are answered by people with a deep interest in the subject. People from around the world review questions, post answers and add comments.
Be part of and influence the most important global discussion that is defining our generation and generations to come