메뉴 건너뛰기

?

단축키

Prev이전 문서

Next다음 문서

크게 작게 위로 아래로 댓글로 가기 인쇄 수정 삭제
?

단축키

Prev이전 문서

Next다음 문서

크게 작게 위로 아래로 댓글로 가기 인쇄 수정 삭제
) _ by Rui Nian IntroductionAny real-life data set used for classification is most likely imbalanced, with the event that you are interested in being very rare (minority examples) while non-interesting events dominate the data set (majority examples). Because of this, machine learning models that we build to identify for the rare cases will perform terribly.
An intuitive example: Imagine classifying for credit card fraud. If there are only 5 fraudulent transactions per 1,000,000 transactions, then all our model has to do is predict negative for all data, and the model will be 99.9995% accurate! Thus, the model will most likely learn to "predict negative" no matter what the input data is, and is completely useless! To combat this problem, the data set must be balanced with similar amounts of positive and negative examples.
Some traditional methods to solve this problem are under-sampling and over-sampling. Under-sampling is where the majority class is down sampled to the same amount of data as the minority class. However, this is extremely data inefficient! The discarded data has important information regarding the negative examples.
Imagine building a house cat classifier, and having 1,000,000 images of different species of animals. But only 50 are cat images (positive examples). After down sampling to about 50 negative example images for a balanced data set, we deleted all pictures of tigers and lions in the original data set. Since tigers and lions look similar to house cats, the classifier will mistake them for house cats! We had examples of tigers and lions, but the model was not trained on them because they were deleted! To avoid this problem of data inefficiency, over-sampling is used. In over-sampling, the minority class is copied x times, until its size is similar to the majority class. The greatest flaw here is our model will overfit to the minority data because the same examples appear so many times.
Image from: KaggleTo avoid all of the above problems, ADASYN can be used! ADASYN (Adaptive Synthetic) is an algorithm that generates synthetic data, and its greatest advantages are not copying the same minority data, and generating more data for "harder to learn" examples. How does it work? Let's find out! Throughout the blog, I will also provide the code for each part of the ADASYN algorithm.
The full code can be found here:
A link to the original paper can be found here
ADASYN AlgorithmStep 1Calculate the ratio of minority to majority examples using:
where mₛ and mₗ are the # of minority and majority class examples respectively. If d is lower than a certain threshold, initialize the algorithm.
Step 2Calculate the total number of synthetic minority data to generate.
Here, G is the total number of minority data to generate. ß is the ratio of minority:majority data desired after ADASYN. ß =1 means a perfectly balanced data set after ADASYN.
Step 3Find the k-Nearest Neighbours of each minority example and calculate the rᵢ value. After this step, each minority example should be associated with a different neighbourhood.
The rᵢ value indicates the dominance of the majority class in each specific neighbourhood. Higher rᵢ neighbourhoods contain more majority class examples and are more difficult to learn. See below for a visualization of this step. In the example, K = 5 (looking for the 5 nearest neighbours).
Step 4Normalize the rᵢ values so that the sum of all rᵢ values equals to 1.
This step is mainly a precursor to make step 5 easier.
Step 5Calculate the amount of synthetic examples to generate per neighbourhood.
Because rᵢ is higher for neighbourhoods dominated by majority class examples, more synthetic minority class examples will be generated for those neighbourhoods. Hence, this gives the ADASYN algorithm its adaptive nature; more data is generated for "harder-to-learn" neighbourhoods.
Step 6Generate Gᵢ data for each neighbourhood. First, take the minority example for the neighbourhood, xᵢ. Then, randomly select another minority example within that neighbourhood, xzᵢ. The new synthetic example can be calculated using:
In the above equation, λ is a random number between 0–1, sᵢ is the new synthetic example, xᵢ and xzᵢ are two minority examples within a same neighbourhood. A visualization of this step is provided below. Intuitively, synthetic examples are created based on a linear combination of xᵢ and xzᵢ.
White noise can be added to the synthetic examples to make the new data even more realistic. Also, instead of linear interpolation, planes can be drawn between 3 minority examples, and points can be generated on the plane instead.
And that's it! With the above steps, any imbalanced data set can now be fixed, and the models built using the new data set should be much more effective.
Weaknesses to ADASYNThere are two major weaknesses of ADASYN:
For minority examples that are sparsely distributed, each neighbourhood may only contain 1 minority example.Precision of ADASYN may suffer due to adaptability nature.To solve the first issue, neighbourhoods with only 1 minority example can have its value duplicated Gi times. A second way is to simply ignore producing synthetic data for such neighbourhoods. Lastly, we can also increase the neighbourhood size.
The second issue arises because more data is generated in neighbourhoods with high amounts of majority class examples. Because of this, the synthetic data generated might be very similar to the majority class data, potentially generating many false positives. One solution is to cap Gi to a maximum amount, so not too many examples are made for these neighbourhoods.
ConclusionThat wraps up the ADASYN algorithm. The biggest advantages of ADASYN are it's adaptive nature of creating more data for "harder-to-learn" examples and allowing you to sample more negative data for your model. Using ADASYN, you can ultimately synthetically balance your data set!
The full code is available on my GitHub:
Thanks for reading, let me know if you have any questions on comments!
Machine LearningData ScienceClassificationAdasynSynthetic Data--
--
7
FollowWritten by Rui Nian45 FollowersAdvanced Process Control Engineer — focused on ML for prediction, monitoring, and control
FollowHelp
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams

If you liked this article so you would like to acquire more info relating to Woodworking please visit our own webpage.

List of Articles
번호 제목 글쓴이 날짜 조회 수
11447 60 Minutes Jeux D'Évasion: A Thrilling Adventure NestorSallee25680 2024.12.10 3
11446 Best Adult Video Chat Apps To Explore BeatrisFeierabend448 2024.12.10 1
11445 Understanding Bongacams.com EdytheDuvall74607730 2024.12.10 1
11444 Top Sex Chat Apps To Try FranchescaWager720 2024.12.10 1
11443 Jeux D'Évasion Horreur : Plongez Dans Une Aventure Terrifiante VerlaBartels5483 2024.12.10 3
11442 Discovering The Benefits Of Real-Time Webcam Interaction UteMoowattin700938 2024.12.10 2
11441 Courtage Automobile Sur Le Québec Madeleine0287225637 2024.12.10 10
11440 The Reasons Behind Using Live Chat Apps JulietA50078614866 2024.12.10 1
11439 Precio De Endodoncia En El Salvador: ¿Cuánto Cuesta Este Tratamiento Dental? JasonLarcombe0618 2024.12.10 1
11438 Déclaration D'Assurance Emploi : Tout Ce Que Vous Devez Savoir SolGarris5926338967 2024.12.10 1
11437 Défi Évasion : Testez Votre Ingéniosité Et Votre Esprit D'Équipe PamDeitz80021169543 2024.12.10 2
11436 Bet777 Casino Review HanneloreLangler1 2024.12.10 1
11435 Défi Évasion : Une Aventure Palpitante à Prix Abordable à Québec Franklyn9922867308 2024.12.10 1
11434 Assurance Location à Court Terme : Guide Complet AltonKidman8242841 2024.12.10 3
11433 Agence Immobilière à Granby : Trouvez Votre Maison De Rêve En Toute Sérénité BuddyBoswell06688 2024.12.10 3
11432 Rénovation De Salle De Bain à Sherbrooke : Guide Complet Par Un Projet Réussi ValentinaShipman951 2024.12.10 1
11431 Contacter Un Expert Immobilier : Conseils Utiles Par Une Communication Efficace WOODorris6597352 2024.12.10 1
11430 Demande De Soumission : Entrepreneur Général Pour Votre Projet Au Québec DarrellManjarrez033 2024.12.10 3
11429 Understanding Why People Search For Love Online TammaraHarney978 2024.12.10 1
11428 Understanding INI Files: A Guide With FileMagic MaricruzBarnes198 2024.12.10 1
Board Pagination Prev 1 ... 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 ... 1584 Next
/ 1584

BANKING ACCOUNT

예금주: 한빛사무기(반재현)

신한은행 100-031-495955

CUSTOMER CENTER

고객센터: 1688-5386

고객센터: 010-5485-8060

팩스: 043-277-7130

이메일: seoknamkang@gmail.com

업무시간: 평일 08-18시. 토, 공휴일휴무

주소: 청주시 흥덕구 복대로 102 세원아파트상가 2층 (복대동 세원아프트 단지내 슈퍼 옆)

대표: 강석남

사업자등록번호: 301-31-50538

통신판매업 신고번호: 012-12345-123

© k2s0o1d4e0s2i1g5n. All Rights Reserved